数据库配对及对抗性列删除 (Database Matching Under Adversarial Column Deletions)

The de-anonymization of users from anonymized microdata through matching or aligning with publicly-available correlated databases has been of scientific interest recently. While most of the rigorous analyses of database matching have focused on random-distortion models, the adversarial-distortion models have been wanting in the relevant literature. In this work, motivated by synchronization errors in the sampling of time-indexed microdata, matching (alignment) of random databases under adversarial column deletions is investigated. It is assumed that a constrained adversary, which observes the anonymized database, can delete up to a $\delta$ fraction of the columns (attributes) to hinder matching and preserve privacy. Column histograms of the two databases are utilized as permutation-invariant features to detect the column deletion pattern chosen by the adversary. The detection of the column deletion pattern is then followed by an exact row (user) matching scheme. The worst-case analysis of this two-phase scheme yields a sufficient condition for the successful matching of the two databases, under the near-perfect recovery condition. A more detailed investigation of the error probability leads to a tight necessary condition on the database growth rate, and in turn, to a single-letter characterization of the adversarial matching capacity. This adversarial matching capacity is shown to be significantly lower than the random matching capacity, where the column deletions occur randomly. Overall, our results analytically demonstrate the privacy-wise advantages of adversarial mechanisms over random ones during the publication of anonymized time-indexed data.

翻译：最近，通过将匿名化的微观数据与公开可用的相关数据库进行匹配或对齐来去匿名化用户已成为科学研究的焦点。尽管数据库匹配的大部分严格分析都集中在随机扭曲模型上，但对于对抗性扭曲模型，相关文献一直比较缺乏。本文通过同步误差来自于时序数据采样，研究了在对抗性列删除下随机数据库的匹配（对齐）。假设一个受限制的攻击者可以删除最多 $\delta$ 小数列（属性），以阻碍匹配并保护隐私。两个数据库的列直方图被用作不变置换特征，以检测攻击者选择的列删除模式。然后，对列删除模式的检测进行了实际行（用户）匹配方案。该二阶段方案的最坏情况分析得出了满足近乎完美恢复条件下成功匹配两个数据库的充分条件。更详细地研究误差概率导致了对数据库增长速率的严格必要条件，并进而得到对抗性匹配容量的单字母特征化。该对抗性匹配容量显示出显著低于随机匹配容量的情况，其中列删除是随机的。总的来说，我们的结果在分析上证明了在发布匿名化的时序数据时，对抗性机制在隐私方面优于随机机制。