The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.
翻译:机器学习在编译器优化领域的进展,尤其是在多面体模型中的应用,受到大规模公开性能数据集稀缺的限制。这一数据瓶颈迫使研究人员开展成本高昂的数据生成工作,延缓了创新进程,并阻碍了基于学习的代码优化的可复现研究。为弥补这一空白,我们推出了LOOPerSet——一个包含2800万个标注数据点的新公开数据集,这些数据点源自22万个独特的、通过合成生成的多面体程序。每个数据点将一个程序及一系列复杂的语义保持变换(如融合、倾斜、分块和并行化)映射到真实的性能度量(执行时间)。LOOPerSet的规模与多样性使其成为训练和评估学习型代价模型、基准测试新模型架构以及探索自动化多面体调度前沿的宝贵资源。该数据集在宽松许可下发布,旨在促进可复现研究,并降低数据驱动编译器优化的入门门槛。