With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.
翻译:随着移动设备能力的进步,将重排序模型直接部署在设备端已成为可能,从而实现了实时上下文感知的推荐。当模型从云端迁移至设备端时,资源异构性不可避免地要求进行模型压缩。近期的量化方法为高效部署带来了希望,但它们忽略了设备特定的用户兴趣,导致推荐准确性受损。虽然设备端微调能够捕捉个性化的用户偏好,但通过本地重新训练也带来了额外的计算负担。为应对这些挑战,我们提出了一个框架,用于通过设备-云协作,为序列推荐定制混合精度设备端模型(CHORD),该框架利用通道级混合精度量化,同时实现个性化与资源自适应部署。CHORD将随机初始化的模型分发到异构设备上,并通过云端的辅助超网络模块识别用户特定的关键参数。我们的参数敏感性分析在多个粒度(层、过滤器和元素级别)上进行,实现了从用户画像到量化策略的精确映射。通过设备端混合精度量化,CHORD无需反向传播即可实现动态模型适应和加速推理,从而消除了成本高昂的重新训练周期。我们通过仅使用每通道2比特(而非32比特权重)对量化策略进行编码,最小化了通信开销。在三个真实世界数据集上,使用两种流行骨干网络(SASRec和Caser)进行的实验验证了CHORD的准确性、效率和适应性。