The development of personalized recommendation has significantly improved the accuracy of information matching and the revenue of e-commerce platforms. Recently, it has 2 trends: 1) recommender systems must be trained timely to cope with ever-growing new products and ever-changing user interests from online marketing and social network; 2) SOTA recommendation models introduce DNN modules to improve prediction accuracy. Traditional CPU-based recommender systems cannot meet these two trends, and GPU- centric training has become a trending approach. However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in CV and NLP areas. This issue can be explained by two characteristics of these recommendation models: First, they contain up to a thousand input feature fields, introducing fragmentary and memory-intensive operations; Second, the multiple constituent feature interaction submodules introduce substantial small-sized compute kernels. To remove this roadblock to the development of recommender systems, we propose a novel framework named PICASSO to accelerate the training of recommendation models on commodity hardware. Specifically, we conduct a systematic analysis to reveal the bottlenecks encountered in training recommendation models. We leverage the model structure and data distribution to unleash the potential of hardware through our packing, interleaving, and caching optimization. Experiments show that PICASSO increases the hardware utilization by an order of magnitude on the basis of SOTA baselines and brings up to 6x throughput improvement for a variety of industrial recommendation models. Using the same hardware budget in production, PICASSO on average shortens the walltime of daily training tasks by 7 hours, significantly reducing the delay of continuous delivery.
翻译:个人化建议的发展大大提高了信息匹配的准确性和电子商务平台的收入。最近,它有2个趋势:1)建议系统必须及时培训,以应对不断增长的新产品和在线营销和社会网络不断变化的用户兴趣;2)SOTA建议模式采用DNN模块,以提高预测准确性;传统基于CPU的建议系统无法满足这两种趋势,而以GPU为中心的培训已成为一种趋势方法。然而,我们注意到,培训建议系统中的GPU装置没有得到充分利用,无法像CV和NLP领域那样实现预期的吞吐改进。 这一问题可以通过以下两个特点加以解释:第一,它们包含多达1000个输入功能领域,引入零碎和记忆密集型操作;第二,基于多个组件的互动子模块无法满足上述两种趋势,而GPU-PU中心培训已成为一种趋势。然而,为了消除这种障碍,我们提议了一个名为PICASSO的新型框架,以加快对商品硬件建议模型的培训。具体地,我们进行了系统化分析,以展示在不断增长的工业成本使用率结构中遇到的瓶颈,通过SLAFIA级标准使用SLA级标准标准标准,我们通过SLAFIAFILA的硬件使用SO的硬件使用模型,从而显示SIAFIFILAFIFIFIFILMMMMMFMFMFMFMFMFMFM 。