Although machine learning and especially deep learning methods have played an important role in the field of information management, privacy protection is an important and concerning topic in current machine learning models. In information management field, a large number of texts containing personal information are produced by users every day. As the model training on information from users is likely to invade personal privacy, many methods have been proposed to block the learning and memorizing of the sensitive data in raw texts. In this paper, we try to do this more linguistically via distorting the text while preserving the semantics. In practice, we leverage a recently our proposed metric, Neighboring Distribution Divergence, to evaluate the semantic preservation during the distortion. Based on the metric, we propose two frameworks for semantics-preserved distortion, a generative one and a substitutive one. We conduct experiments on named entity recognition, constituency parsing, and machine reading comprehension tasks. Results from our experiments show the plausibility and efficiency of our distortion as a method for personal privacy protection. Moreover, we also evaluate the attribute attack on three privacy-related tasks in the current natural language processing field, and the results show the simplicity and effectiveness of our data-based improvement approach compared to the structural improvement approach. Further, we also investigate the effects of privacy protection in specific medical information management in this work and show that the medical information pre-training model using our approach can effectively reduce the memory of patients and symptoms, which fully demonstrates the practicality of our approach.
翻译:虽然机器学习和特别深层次的学习方法在信息管理领域发挥了重要作用,但隐私保护是当前机器学习模式中的一个重要和主题。在信息管理领域,用户每天制作大量载有个人信息的文字。由于用户信息示范培训有可能侵犯个人隐私,因此提出了许多方法来阻止对原始文本中敏感数据的学习和记忆化。在本文中,我们试图通过扭曲文字来更用语言方式做到这一点,同时保留语义保护。在实践中,我们利用最近提出的指标“相邻分配差异”来评价扭曲期间的语义保存。我们根据指标,提出了两种语义学上作准的扭曲性框架,一种是基因化框架,一种是非结构性框架。我们进行了关于名称实体识别、选区划分和机器阅读理解任务的实验。我们实验的结果显示,我们作为保护个人隐私的一种方法,我们扭曲的模型前是值得称赞的,效率很高。此外,我们还评估了当前自然语言处理领域的三种与隐私有关的任务的属性攻击。我们根据指标,提出了两种框架,即语义学上的扭曲性扭曲性变,一个是基因化的,一个是结构上的改进,显示了我们在工作中的简化和结构上的信息管理效果。我们用基于医学方法的改进的方法,可以充分调查我们的数据改进和结构上的数据改进,从而显示我们的信息管理方法的简化和结构上的结果,从而展示了我们改进了我们改进了我们的信息管理方法的改进了我们改进了我们的工作效果。