In Federated Learning (FL), accessing private client data incurs communication and privacy costs. As a result, FL deployments commonly prefinetune pretrained foundation models on a (large, possibly public) dataset that is held by the central server; they then FL-finetune the model on a private, federated dataset held by clients. Evaluating prefinetuning dataset quality reliably and privately is therefore of high importance. To this end, we propose FreD (Federated Private Fr\'echet Distance) -- a privately computed distance between a prefinetuning dataset and federated datasets. Intuitively, it privately computes and compares a Fr\'echet distance between embeddings generated by a large language model on both the central (public) dataset and the federated private client data. To make this computation privacy-preserving, we use distributed, differentially-private mean and covariance estimators. We show empirically that FreD accurately predicts the best prefinetuning dataset at minimal privacy cost. Altogether, using FreD we demonstrate a proof-of-concept for a new approach in private FL training: (1) customize a prefinetuning dataset to better match user data (2) prefinetune (3) perform FL-finetuning.
翻译:在联邦学习联盟(FL)中,获取私人客户数据需要通信和隐私费用。因此,FL通常在中央服务器所持有的(大的、可能公开的)数据集上部署预先训练的基础模型;然后是FL-Finnet在客户所持有的私人、联合的数据集上安装模型。因此,评估预先调整数据元件的可靠和私下质量非常重要。为此,我们提议FreD(Freeded Private Fr\'echet Leater) -- -- 预先调整数据集和联邦数据集之间的私算距离。它直观地进行私人计算,比较中央(公共)数据集和联邦私人客户数据中大型语言模型产生的嵌入点之间的Fr\echet距离。为了进行这一计算,我们使用了分布式、差别私人平均值和差异性估计器。我们从经验上表明,FreD准确预测了以最低隐私成本为最佳预调数据集。我们用FreD系统演示了一种更好的私制用户校准前数据测试方法。我们用FFFFficure a refficreduction a reflaction a regiltal pract dalction a dalact dalction a dalction: