The effectiveness of Machine Learning (ML) methods depend on access to large suitable datasets. In this article, we present how we build the LS-CAT (Large-Scale CUDA AutoTuning) dataset sourced from GitHub for the purpose of training NLP-based ML models. Our dataset includes 19 683 CUDA kernels focused on linear algebra. In addition to the CUDA codes, our LS-CAT dataset contains 5 028 536 associated runtimes, with different combinations of kernels, block sizes and matrix sizes. The runtime are GPU benchmarks on both Nvidia GTX 980 and Nvidia T4 systems. This information creates a foundation upon which NLP-based models can find correlations between source-code features and optimal choice of thread block sizes. There are several results that can be drawn out of our LS-CAT database. E.g., our experimental results show that an optimal choice in thread block size can gain an average of 6% for the average case. We thus also analyze how much performance increase can be achieved in general, finding that in 10% of the cases more than 20% performance increase can be achieved by using the optimal block. A description of current and future work is also included.
翻译:机器学习(ML)方法的有效性取决于对大型合适数据集的存取。 在本篇文章中,我们介绍了我们如何从GitHub建立LS-CAT(大型CUDA自动图灵)数据集,以培训基于NLP的ML模型。我们的数据集包括19 683个CUDA内核,侧重于线性代数。除了CUDA 代码外,我们的LS-CAT数据集包含5 028 536个相关运行时间,同时有不同的内核、区块大小和矩阵大小组合。运行时间是Nvidia GTX 980和Nvidia T4系统的GPU基准。这一信息为基于NLP的模型找到源代码特性和最优化选择线性块大小之间的关系奠定了基础。除了CUDA 代码外,我们的LS-CAT数据库还得出若干结果,我们的实验结果显示,在线形块大小上的最佳选择可以平均达到6%。因此,在平均案例中,我们还要分析在Nvidia GTX 980 和Nvidia Trea Treal deal deal destration 中如何实现最佳性提升了10 。