In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible hardness assumption and assuming the distributions generating the public and private data satisfy certain properties. We show that the answer to this appears to be quite subtle and closely related to the average-case complexity of a new multi-task, missing-data version of the classic problem of phase retrieval. Motivated by this connection, we design a provable algorithm that can recover private vectors using only the public vectors and synthetic vectors generated by InstaHide, under the assumption that the private and public vectors are isotropic Gaussian.
翻译:在这项工作中,我们检查了InstaHide的安全性,这是[广、宋、李和阿罗拉,ICML'20]最近提出的在分布式学习背景下维护私人数据集安全性的计划。为了形成一个综合培训范例,供分布式学习者共享,InstaHide采用了私人特性矢量的混集,随机翻转了由此产生的矢量的每一次输入的标记,概率为1/2.5。一个突出的问题是,这一计划是否在任何可证实的意义上都具有安全性,也许根据一种看似坚硬的假设,并假定生成公共和私人数据的分布满足某些特性。我们表明,对此的答案似乎相当微妙,而且与阶段检索这一典型问题的新的多任务缺失数据版本的普通复杂性密切相关。受此关联的驱动,我们设计了一种可行的算法,仅使用InstaHide产生的公共矢量和合成矢量,根据私人和公共矢量是异质高斯的假设,我们设计了一种仅能恢复私人病媒和合成矢量的可实现的可变算法。