In modern machine learning, users often have to collaborate to learn the distribution of the data. Communication can be a significant bottleneck. Prior work has studied homogeneous users -- i.e., whose data follow the same discrete distribution -- and has provided optimal communication-efficient methods for estimating that distribution. However, these methods rely heavily on homogeneity, and are less applicable in the common case when users' discrete distributions are heterogeneous. Here we consider a natural and tractable model of heterogeneity, where users' discrete distributions only vary sparsely, on a small number of entries. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution; relying on methods from robust statistics. Then, the learned central distribution is fine-tuned to estimate their respective individual distribution. We show that SHIFT is minimax optimal in our model of heterogeneity and under communication constraints. Further, we provide experimental results using both synthetic data and $n$-gram frequency estimation in the text domain, which corroborate its efficiency.
翻译:在现代机器学习中,用户往往必须合作学习数据的分配情况。通信可能是一个很大的瓶颈。先前的工作已经研究了同质用户 -- -- 即其数据采用相同的离散分布方法 -- -- 并提供了最佳的通信效率方法来估计该分布情况。然而,这些方法严重依赖同质性,在用户的离散分布不一时,这些方法在常见情况下不太适用。我们在这里考虑的是一种自然和可移动的异质性模式,用户的离散分布在少数条目上差别很小。我们提出了一种名为SHIFT的新颖的两阶段方法:首先,用户通过与服务器沟通来学习中央分布;依靠可靠统计数据的方法。然后,学习的中央分布经过微调,以估计其各自的分布情况。我们表明,SHIFT在我们的异性模式中和通信受限制的情况下是最小的。此外,我们提供实验结果,同时使用合成数据和文本域的美元频度估算,以证实其效率。