理解数据存储和吸收数据用于大型深建议示范培训 (Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training)

Mark Zhao,Niket Agarwal,Aarti Basant,Bugra Gedik,Satadru Pan,Mustafa Ozdal,Rakesh Komuravelli,Jerry Pan,Tianshu Bao,Haowei Lu,Sundaram Narayanan,Jack Langman,Kevin Wilfong,Harsha Rastogi,Carole-Jean Wu,Christos Kozyrakis,Parik Pol

from arxiv, In The 49th Annual International Symposium on Computer Architecture (ISCA 2022)

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.

翻译：由数千个具体领域加速器组成的AI培训集群(DSA)被用于培训日益复杂的深层次学习模式,这些集群依靠数据储存和摄入管道,负责储存培训数据的字节数,每秒为数十兆兆字节提供服务。由于DSA继续提高培训效率和吞吐量,DSI管道正在成为制约总体培训业绩和能力的主要因素。提高DSI系统和硬件的效率和性能的创新是紧迫的,要求深入了解DSI的特点和基础设施的规模。本文介绍了Meta的端到端的DSI管道,该管道由一个中央数据仓库组成,该仓库建立在分布式储存和数据预处理服务上,以缩小数据档数。随着DSA继续提高培训效率和吞吐量,DSI管道正在成为制约总体培训业绩和能力的主要因素。这些培训岗位阅读并大量过滤了大量的和不断演变的数据集,从而导致在培训工作中使用的流行特征和样本。我们测量了DSI的端到端到端的端到端-端-端-端-端-级-级-级-级-端-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-