多层隐私披露探测多级混合神经网络 (A Multi-input Multi-output Transformer-based Hybrid Neural Network for Multi-class Privacy Disclosure Detection)

The concern regarding users' data privacy has risen to its highest level due to the massive increase in communication platforms, social networking sites, and greater users' participation in online public discourse. An increasing number of people exchange private information via emails, text messages, and social media without being aware of the risks and implications. Researchers in the field of Natural Language Processing (NLP) have concentrated on creating tools and strategies to identify, categorize, and sanitize private information in text data since a substantial amount of data is exchanged in textual form. However, most of the detection methods solely rely on the existence of pre-identified keywords in the text and disregard the inference of the underlying meaning of the utterance in a specific context. Hence, in some situations, these tools and algorithms fail to detect disclosure, or the produced results are miss-classified. In this paper, we propose a multi-input, multi-output hybrid neural network which utilizes transfer-learning, linguistics, and metadata to learn the hidden patterns. Our goal is to better classify disclosure/non-disclosure content in terms of the context of situation. We trained and evaluated our model on a human-annotated ground truth dataset, containing a total of 5,400 tweets. The results show that the proposed model was able to identify privacy disclosure through tweets with an accuracy of 77.4% while classifying the information type of those tweets with an impressive accuracy of 99%, by jointly learning for two separate tasks.

翻译：由于通信平台、社交网络网站大量增加,用户更多地参与在线公共讨论,对用户数据隐私的关注已上升到最高水平。越来越多的人通过电子邮件、短信和社交媒体交流私人信息,而没有意识到风险和影响。自然语言处理领域的研究人员(NLP)侧重于创建工具和战略,以识别、分类和净化文本数据中的私人信息,因为大量数据以文字形式交换。然而,大多数检测方法完全依赖文本中预先确定的关键字的存在,而无视在特定背景下表达的基本含义的推论。因此,在某些情况下,这些工具和算法未能发现披露,或得出的结果分类错误。在本文件中,我们建议建立一个多投入、多投入混合神经网络,利用转移学习、语言和元数据来学习隐藏的模式。我们的目标是更好地将披露/不披露内容在总体情况中进行分类,而忽视了在特定背景下对言论基本含义的推断。因此,在某些情况下,这些工具和算法未能发现披露披露,或者对所产生的结果进行错误分类。我们用一种具有注释性的模型,用一种显示真实性的推文的推文的模型,通过两种推文的推算结果,通过两种推算结果的模型,通过推算出一种推算的推算的推算结果的推算出了一种推算结果。