Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets dataset, consisting of approximately 178M high $p_T$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet-$\alpha$ foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.
翻译:基础模型是在海量数据上预训练的深度学习模型,能够泛化至多个数据集和/或下游任务。本研究展示了大型强子对撞机CMS实验所采集数据如何有效用于高能物理基础模型的预训练。具体而言,我们推出AspenOpenJets数据集,该数据集包含约1.78亿个源自CMS 2016开放数据的高$p_T$喷注。我们证明,在AspenOpenJets上预训练OmniJet-$\alpha$基础模型可显著提升存在明显领域偏移的生成任务性能:生成来自模拟JetClass数据集的boosted顶夸克喷注与QCD喷注。除了验证基于喷注的基础模型在实际质子-质子对撞数据上预训练的有效性,我们还提供可直接用于机器学习的AspenOpenJets衍生数据集以供公众进一步使用。