Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identification. To address these limitations, we present PADME, a distributed analytics tool that federates model implementation and training. PADME uses a federated approach where the model is implemented and deployed by all parties and visits each data location incrementally for training. This enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location. Training the model on data in its original location preserves data ownership. Furthermore, the results are not provided until the analysis is completed on all data locations to ensure privacy and avoid bias in the results.
翻译:数据隐私和所有权在社交数据科学中至关重要,引发了法律和伦理方面的担忧。当不同方拥有不同部分的数据时,共享和分析数据十分困难。解决这个挑战的方法之一是在收集数据进行分析之前,对数据应用去标识化或匿名化技术。然而,这样做可能会降低数据效用,增加再识别的风险。为了解决这些局限,我们提出了PADME,这是一个分布式分析工具,用于联邦模型实现和训练。PADME采用联邦方法,即每个参与方都实现和部署模型,并逐渐访问每个数据位置进行训练。这使得可以在不同位置的数据上进行分析,同时仍允许模型在单个位置上训练,就像所有数据都在一个位置上一样。在原始位置上的数据培训模型保持数据所有权。此外,结果在完成所有数据位置的分析之前不会得到提供,以确保隐私并避免结果中的偏见。