Data minimization is a legal obligation defined in the European Union's General Data Protection Regulation (GDPR) as the responsibility to process an adequate, relevant, and limited amount of personal data in relation to a processing purpose. However, unlike fairness or transparency, the principle has not seen wide adoption for machine learning systems due to a lack of computational interpretation. In this paper, we build on literature in machine learning and law to propose the first learning framework for limiting data collection based on an interpretation that ties the data collection purpose to system performance. We formalize a data minimization criterion based on performance curve derivatives and provide an effective and interpretable piecewise power law technique that models distinct stages of an algorithm's performance throughout data collection. Results from our empirical investigation offer deeper insights into the relevant considerations when designing a data minimization framework, including the choice of feature acquisition algorithm, initialization conditions, as well as impacts on individuals that hint at tensions between data minimization and fairness.
翻译:尽量减少数据是欧洲联盟《一般数据保护条例》界定的一项法律义务,即有责任处理与处理目的有关的适当、相关和有限的个人数据,然而,与公平性或透明度不同,由于缺乏计算解释,该原则没有被广泛采用到机器学习系统;在本文中,我们以机器学习和法律方面的文献为基础,提出了第一个限制数据收集的学习框架,其依据是将数据收集目的与系统业绩联系起来的解释;我们正式确定了以性能曲线衍生物为基础的数据尽量减少标准,并提供了有效和可解释的、可解释的分层功率法技术,在数据收集过程中对算法不同阶段的性能进行模型;我们的经验调查结果更深入地揭示了在设计数据尽量减少框架时的相关考虑因素,包括特性获取算法的选择、初始条件,以及对暗示数据尽量减少与公平之间的紧张关系的个人的影响。