The use of mutual information as a tool in private data sharing has remained an open challenge due to the difficulty of its estimation in practice. In this paper, we propose InfoShape, a task-based encoder that aims to remove unnecessary sensitive information from training data while maintaining enough relevant information for a particular ML training task. We achieve this goal by utilizing mutual information estimators that are based on neural networks, in order to measure two performance metrics, privacy and utility. Using these together in a Lagrangian optimization, we train a separate neural network as a lossy encoder. We empirically show that InfoShape is capable of shaping the encoded samples to be informative for a specific downstream task while eliminating unnecessary sensitive information. Moreover, we demonstrate that the classification accuracy of downstream models has a meaningful connection with our utility and privacy measures.
翻译:将相互信息作为私人数据共享的工具,由于实际估算困难重重,因此仍然是一项公开的挑战。我们提议InfoShape,这是一个基于任务的编码器,旨在将不必要的敏感信息从培训数据中去除,同时为特定 ML 培训任务保留足够的相关信息。我们通过利用基于神经网络的相互信息估计器来实现这一目标,以衡量两种性能衡量标准、隐私和实用性。在拉格朗江优化中,我们共同将这些数据用于培训一个单独的神经网络作为丢失的编码器。我们从经验上表明,InfoShape能够塑造编码的样本,以便在消除不必要的敏感信息的同时,为具体的下游任务提供信息。此外,我们证明下游模型的分类准确性与我们的实用性和隐私措施有着有意义的联系。