用于计算机学习光谱数据的通用合成数据集 (A universal synthetic dataset for machine learning on spectroscopic data)

To assist in the development of machine learning methods for automated classification of spectroscopic data, we have generated a universal synthetic dataset that can be used for model validation. This dataset contains artificial spectra designed to represent experimental measurements from techniques including X-ray diffraction, nuclear magnetic resonance, and Raman spectroscopy. The dataset generation process features customizable parameters, such as scan length and peak count, which can be adjusted to fit the problem at hand. As an initial benchmark, we simulated a dataset containing 35,000 spectra based on 500 unique classes. To automate the classification of this data, eight different machine learning architectures were evaluated. From the results, we shed light on which factors are most critical to achieve optimal performance for the classification task. The scripts used to generate synthetic spectra, as well as our benchmark dataset and evaluation routines, are made publicly available to aid in the development of improved machine learning models for spectroscopic analysis.

翻译：为了协助开发光谱数据自动分类的机器学习方法,我们制作了一个通用合成数据集,可用于模型验证。该数据集包含人工光谱,旨在代表X射线分解、核磁共振和拉曼光谱学等技术的实验性测量。数据集生成过程具有可定制的参数,如扫描长度和峰值计,这些参数可以调整以适应手头问题。作为初步基准,我们模拟了一个数据集,包含35 000个光谱,以500个独特类别为基础。为了将这一数据的分类自动化,对8个不同的机器学习结构进行了评估。从结果中,我们阐明了哪些因素对于实现分类任务的最佳性能最为关键。用于生成合成光谱的脚本以及我们的基准数据集和评价常规,被公诸于众,以帮助开发改进的光谱分析机器学习模型。