The utilisation of large and diverse datasets for machine learning (ML) at scale is required to promote scientific insight into many meaningful problems. However, due to data governance regulations such as GDPR as well as ethical concerns, the aggregation of personal and sensitive data is problematic, which prompted the development of alternative strategies such as distributed ML (DML). Techniques such as Federated Learning (FL) allow the data owner to maintain data governance and perform model training locally without having to share their data. FL and related techniques are often described as privacy-preserving. We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind. We further provide recommendations and examples on how such algorithms can be augmented to provide guarantees of governance, security, privacy and verifiability for a general ML audience without prior exposure to formal privacy techniques.
翻译:为了促进对许多有意义的问题进行科学了解,需要利用大规模和多样化的机器学习数据集(ML),以便推动对许多有意义的问题进行科学了解,但是,由于诸如GDPR等数据治理条例以及伦理问题,个人和敏感数据的汇总存在问题,这促使开发了分布式ML(DML)等替代战略; 联邦学习(FL)等技术使数据拥有者能够维持数据治理,在当地进行示范培训,而不必分享数据; FL和相关技术往往被描述为隐私保护; 我们解释为什么这个术语不合适,并概述了过度依赖没有考虑到隐私正式定义的协议所带来的风险; 我们还就如何扩大这些算法,为一般ML受众提供治理、安全、隐私和可核查性的保障,而无需事先接触正式隐私技术,我们进一步提出建议和实例。