Extremely high data rates at modern synchrotron and X-ray free-electron laser light source beamlines motivate the use of machine learning methods for data reduction, feature detection, and other purposes. Regardless of the application, the basic concept is the same: data collected in early stages of an experiment, data from past similar experiments, and/or data simulated for the upcoming experiment are used to train machine learning models that, in effect, learn specific characteristics of those data; these models are then used to process subsequent data more efficiently than would general-purpose models that lack knowledge of the specific dataset or data class. Thus, a key challenge is to be able to train models with sufficient rapidity that they can be deployed and used within useful timescales. We describe here how specialized data center AI (DCAI) systems can be used for this purpose through a geographically distributed workflow. Experiments show that although there are data movement cost and service overhead to use remote DCAI systems for DNN training, the turnaround time is still less than 1/30 of using a locally deploy-able GPU.
翻译:现代同步器和X射线自由电子激光光源光束极高的数据率激励使用机器学习方法减少数据、特征探测和其他目的。 不论应用情况如何,基本概念是一样的:在试验的早期阶段收集的数据、过去类似实验中的数据和/或为即将进行的试验模拟的数据,都用于培训机器学习模型,实际上,这些模型可以了解这些数据的具体特点;这些模型然后用来更有效地处理随后的数据,而不是对具体数据集或数据类别缺乏了解的通用模型。因此,一个关键的挑战是如何能够以足够快的速度培训模型,以便能够在有用的时间尺度内部署和使用这些模型。 我们在这里说明如何通过地理分布的工作流程为此目的使用专门的数据中心AI(DCAI)系统。 实验表明,虽然数据移动成本和服务管理成本很高,可以使用远程DCAI系统进行DNN培训,但使用本地部署的GPU的周转时间仍然不到1/30。