On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our co-design techniques obtain over $5 \times$ additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a naively implemented system -- while maintaining model accuracy, and limiting inference communication and response latency to within $300$KB and $<100$ms respectively.
翻译:内部机器学习(ML)的推论可以使用户在没有远程服务器的用户设备上使用私人用户数据。然而,对于许多依赖嵌入表过于庞大无法在装置上储存的应用程序来说,纯粹的在线点数计算法对于依赖嵌入表而过于庞大无法在装置上储存的许多应用程序来说是不切实际的。为了克服这一障碍,我们建议使用私人信息检索(PIR)来高效和私下从服务器上检索嵌入,而不在优化的 CPU PIR 执行期间分享任何私人信息。由于现成的PIR 算法通常在计算上过于密集,无法直接用于对衬里敏感的推断任务,因此我们(1) 开发一种新型算法,用于加速对GPL的 PIR,2 与下游ML 应用程序共同设计PIR,以获得进一步加速速度。我们的GPU加速战略在优化的CPU PIR 执行期间,使系统通过超过20美元的时间来改进,而我们的共同设计技术在固定的模型质量上获得超过5美元的额外吞值改进。在100美元上,同时,在各种ML应用程序上提供300的应用程序,同时进行100美元的升级查询,通过100美元,在100美元进行一个单一的系统内进行一个测试,通过一个100美元的改进。