Institutions collect massive learning traces but they may not disclose it for privacy issues. Synthetic data generation opens new opportunities for research in education. In this paper we present a generative model for educational data that can preserve the privacy of participants, and an evaluation framework for comparing synthetic data generators. We show how naive pseudonymization can lead to re-identification threats and suggest techniques to guarantee privacy. We evaluate our method on existing massive educational open datasets.
翻译:合成数据生成为教育研究开辟了新的机会。在本文件中,我们提出了一个能够保护参与者隐私的教育数据遗传模型,以及一个比较合成数据生成器的评价框架。我们展示了天真的假名化如何导致重新识别威胁并提出保障隐私的技术。我们评估了现有大规模教育开放数据集的方法。