Modern machine learning systems are increasingly characterized by extensive personal data collection, despite the diminishing returns and increasing societal costs of such practices. Yet, data minimisation is one of the core data protection principles enshrined in the European Union's General Data Protection Regulation ('GDPR') and requires that only personal data that is adequate, relevant and limited to what is necessary is processed. However, the principle has seen limited adoption due to the lack of technical interpretation. In this work, we build on literature in machine learning and law to propose FIDO, a Framework for Inhibiting Data Overcollection. FIDO learns to limit data collection based on an interpretation of data minimization tied to system performance. Concretely, FIDO provides a data collection stopping criterion by iteratively updating an estimate of the performance curve, or the relationship between dataset size and performance, as data is acquired. FIDO estimates the performance curve via a piecewise power law technique that models distinct phases of an algorithm's performance throughout data collection separately. Empirical experiments show that the framework produces accurate performance curves and data collection stopping criteria across datasets and feature acquisition algorithms. We further demonstrate that many other families of curves systematically overestimate the return on additional data. Results and analysis from our investigation offer deeper insights into the relevant considerations when designing a data minimization framework, including the impacts of active feature acquisition on individual users and the feasability of user-specific data minimization. We conclude with practical recommendations for the implementation of data minimization.
翻译:现代机器学习系统日益以广泛的个人数据收集为特征,尽管这些做法的回报减少,社会成本增加;然而,数据最小化是欧洲联盟《数据保护总条例》(《数据保护总条例》)所载的核心数据保护原则之一,要求只处理足够、相关和限于必要程度的个人数据;然而,由于缺乏技术解释,该原则的采用有限;在这项工作中,我们在机器学习和法律文献的基础上,提议FIDO,即禁止数据收集过度的框架;FIDO学会根据对与系统业绩挂钩的数据最小化的解释限制数据收集;具体地说,FIDO提供数据收集停止标准,办法是反复更新对业绩曲线或数据集大小与性能之间的关系的估计,因为数据是获得的;FIDO通过一种精密的权力法技术来估计业绩曲线,这种技术将算算法在整个数据收集过程中的不同阶段作为模型;真实的实验表明,该框架产生了准确的业绩曲线和数据收集工作,从而阻止了跨数据集和特征获取算法的标准;我们进一步表明,在用户对具体数据进行更精确性分析时,通过系统化的精确性分析,将有关数据分析的结果用于对用户进行更精确性的分析。