Matching entries of correlated shuffled databases have practical applications ranging from privacy to biology. In this paper, motivated by synchronization errors in the sampling of time-indexed databases, matching of random databases under random column repetitions and deletions is investigated. It is assumed that for each entry (row) in the database, the attributes (columns) are correlated, which is modeled as a Markov process. Column histograms are proposed as a permutation-invariant feature to detect the repetition pattern, whose asymptotic-uniqueness is proved using information-theoretic tools. Repetition detection is then followed by a typicality-based row matching scheme. Considering this overall scheme, sufficient conditions for successful matching of databases in terms of the database growth rate are derived. A modified version of Fano's inequality leads to a tight necessary condition for successful matching, establishing the matching capacity under column repetitions. This capacity is equal to the erasure bound, which assumes the repetition locations are known a-priori. Overall, our results provide insights on privacy-preserving publication of anonymized time-indexed data.
翻译:匹配相关折叠数据库的条目具有从隐私到生物学的实用应用。 在本文中,由时间索引数据库抽样的同步错误推动,对随机数据库进行随机列重复和删除的匹配进行调查。 假设数据库中每个条目( 滚) 的属性( 列) 相关属性( 列) 具有关联性, 以 Markov 进程为模型。 列直方图作为一种变异性特征被提出, 以检测重复模式, 其无症状被证明是使用信息理论工具证明的。 然后, 以典型性为主的行匹配方案。 考虑到这一总体方案, 得出了在数据库增长率方面成功匹配数据库的充分条件。 修改后的法诺的不平等性导致成功匹配的严格必要条件, 建立列重复下的匹配能力。 此能力等于取消约束, 假设重复位置为已知的首要位置。 总体而言, 我们的结果提供了对保留隐私的旧时间索引数据出版的洞察力。