Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter $\lambda$ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.
翻译:数据匿名的传统方法独立地考虑关系数据和文本数据。 我们建议 rx- anon, 一种由关系属性和文本属性组成的异同半结构化文档的匿名方法。 我们绘制从文本中提取的敏感术语到结构化数据。 允许我们使用 k- 匿名概念来生成一个合并的、 隐私保存的多元数据输入版本。 我们引入了冗余敏感信息的概念, 以一致的方式将各种数据匿名。 为了控制匿名对非结构化文本数据相对于结构化数据属性的影响, 我们引入了一种修改的、 参数化的Mondrian算法。 参数 $\ lambda$ 允许在匿名化过程中对关系属性和文本属性给予不同的分量。 我们用两个真实世界数据集来评估我们的方法, 使用一种普通化的百分数分数, 适应于将关系和文本数据联名化的问题。 结果表明, 我们的方法能够减少信息损失, 使用调调参数来控制Mondrian 分解的 Mondrian 分解法和参数, 同时保证 k- an- adian 运算法是其他敏感的精密性 方法。