We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.49x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
翻译:我们通过深学习建议模式(DLRM)培训管道提出一套端到端基础设施优化,即RecD(建议解说),这是一套在深学习建议模式(DLRM)培训管道中进行的端到端基础设施优化。RecD处理因行业规模的DLRM培训数据集中固有的特征重复而导致的大量储存、预处理和培训间接费用。由于DLRM数据集中的特性重复而产生特性重复。由于互动产生DLRM数据集,因此产生特性重复。虽然每个用户会议可以产生多个培训样本,但许多特征的价值在这些样本中不会改变。我们展示RecD如何在部署的培训管道中利用这一属性、端到端到端,使数据生成管道优化,以减少数据集的储存和预处理资源需求,并在培训批中最大限度地实现重复。 RecD采用了一种新的发配格式,Inverse KeyJloggedTensors(IKJT),在每批中都显示DRML培训系统中,DRM模型结构如何利用IKJT的急剧增加培训。 Redustrumpt,提高培训和预处理和储存效率,分别达2.49x1.79x和3.71x。