对非常大型数据集进行灵活和可扩展的隐私评估,并应用于政府官方微观数据 (Flexible and scalable privacy assessment for very large datasets, with an application to official governmental microdata)

We present a systematic refactoring of the conventional treatment of privacy analyses, basing it on mathematical concepts from the framework of Quantitative Information Flow (QIF). The approach we suggest brings three principal advantages: it is flexible, allowing for precise quantification and comparison of privacy risks for attacks both known and novel; it can be computationally tractable for very large, longitudinal datasets; and its results are explainable both to politicians and to the general public. We apply our approach to a very large case study: the Educational Censuses of Brazil, curated by the governmental agency INEP, which comprise over 90 attributes of approximately 50 million individuals released longitudinally every year since 2007. These datasets have only very recently (2018-2021) attracted legislation to regulate their privacy -- while at the same time continuing to maintain the openness that had been sought in Brazilian society. INEP's reaction to that legislation was the genesis of our project with them. In our conclusions here we share the scientific, technical, and communication lessons we learned in the process.

翻译：我们系统地重新提出对隐私分析的传统处理方法,将其建立在定量信息流动(QIF)框架内的数学概念基础上。我们建议的方法具有三个主要优势:灵活,允许对已知和新颖攻击的隐私风险进行精确的量化和比较;可以计算出用于非常庞大的纵向数据集;其结果可以向政治家和公众解释;我们将我们的方法应用于一个非常庞大的案例研究:由政府机构INEP制定的巴西教育普查,它由2007年以来每年纵向释放的大约5 000万人的90多个属性组成。这些数据集最近(2018-2021年)才吸引立法来规范他们的隐私,同时继续保持巴西社会寻求的开放性。 INEP对立法的反应是我们与他们一起开展的项目的起源。我们在这里的结论中分享我们在这个过程中学到的科学、技术和交流教训。