Collecting personally identifiable information (PII) on data subjects has become big business. Data brokers and data processors are part of a multi-billion-dollar industry that profits from collecting, buying, and selling consumer data. Yet there is little transparency in the data collection industry which makes it difficult to understand what types of data are being collected, used, and sold, and thus the risk to individual data subjects. In this study, we examine a large textual dataset of privacy policies from 1997-2019 in order to investigate the data collection activities of data brokers and data processors. We also develop an original lexicon of PII-related terms representing PII data types curated from legislative texts. This mesoscale analysis looks at privacy policies overtime on the word, topic, and network levels to understand the stability, complexity, and sensitivity of privacy policies over time. We find that (1) privacy legislation correlates with changes in stability and turbulence of PII data types in privacy policies; (2) the complexity of privacy policies decreases over time and becomes more regularized; (3) sensitivity rises over time and shows spikes that are correlated with events when new privacy legislation is introduced.
翻译:在收集个人可识别的数据主题信息(PII)方面,数据经纪人和数据处理员已成为一个数十亿美元的行业的一部分,从收集、购买和出售消费者数据中获益。然而,数据收集行业透明度低,难以了解收集、使用和出售哪些类型的数据,从而难以了解个人数据主题的风险。在本研究报告中,我们审查了1997-2019年关于隐私政策的一大批文字数据集,以调查数据经纪人和数据处理员的数据收集活动。我们还开发了一个与PII有关的术语的原始词汇,它代表了由立法文本整理的PII数据类型。这种中尺度分析着眼于文字、专题和网络层面的隐私政策超时,以了解隐私政策的长期稳定性、复杂性和敏感性。我们发现:(1) 隐私立法与隐私政策中PII数据类型稳定性和动荡性的变化有关;(2) 隐私政策的复杂性随着时间推移而减少并变得更加正规化;(3) 敏感度随着时间推移而提高,并显示出与引入新的隐私立法时的事件相关的高峰。