Modern recommender systems operate in a fully server-based fashion. To cater to millions of users, the frequent model maintaining and the high-speed processing for concurrent user requests are required, which comes at the cost of a huge carbon footprint. Meanwhile, users need to upload their behavior data even including the immediate environmental context to the server, raising the public concern about privacy. On-device recommender systems circumvent these two issues with cost-conscious settings and local inference. However, due to the limited memory and computing resources, on-device recommender systems are confronted with two fundamental challenges: (1) how to reduce the size of regular models to fit edge devices? (2) how to retain the original capacity? Previous research mostly adopts tensor decomposition techniques to compress the regular recommendation model with limited compression ratio so as to avoid drastic performance degradation. In this paper, we explore ultra-compact models for next-item recommendation, by loosing the constraint of dimensionality consistency in tensor decomposition. Meanwhile, to compensate for the capacity loss caused by compression, we develop a self-supervised knowledge distillation framework which enables the compressed model (student) to distill the essential information lying in the raw data, and improves the long-tail item recommendation through an embedding-recombination strategy with the original model (teacher). The extensive experiments on two benchmarks demonstrate that, with 30x model size reduction, the compressed model almost comes with no accuracy loss, and even outperforms its uncompressed counterpart in most cases.
翻译:现代推荐人系统完全以服务器为基础运作。 满足数百万用户的需要, 需要经常维护模型和高速处理同时用户请求, 而这需要花费巨大的碳足迹。 与此同时, 用户需要将行为数据上传到服务器, 包括即时环境背景, 提高公众对于隐私的关注。 内部设置建议系统以成本意识设置和本地推断的方式绕过这两个问题。 但是, 由于记忆和计算资源有限, 在线设置建议系统面临两个基本挑战:(1) 如何减少常规模型的规模以适应边缘设备? (2) 如何保留原始能力? 以往的研究大多采用高压分解配置技术, 以有限的压缩比例压缩常规建议模式压缩, 以避免急剧的性能退化。 在本文中,我们探索下个项目建议的超复杂性模型, 将维度一致性的制约放在高压模型中。 与此同时, 我们开发一个自上调的知识蒸馏框架, 使压缩模型( 最精确度) 能够保持原有的准确性能保持原有能力? 以往的研究大多采用高压分位技术, 以避免大幅度的缩缩缩缩, 将原始的实验中的基本信息模型 。