基于文本向量的假新闻检测：一切都在于嵌入！ (It's All in the Embedding! Fake News Detection Using Document Embeddings)

With the current shift in the mass media landscape from journalistic rigor to social media, personalized social media is becoming the new norm. Although the digitalization progress of the media brings many advantages, it also increases the risk of spreading disinformation, misinformation, and malformation through the use of fake news. The emergence of this harmful phenomenon has managed to polarize society and manipulate public opinion on particular topics, e.g., elections, vaccinations, etc. Such information propagated on social media can distort public perceptions and generate social unrest while lacking the rigor of traditional journalism. Natural Language Processing and Machine Learning techniques are essential for developing efficient tools that can detect fake news. Models that use the context of textual data are essential for resolving the fake news detection problem, as they manage to encode linguistic features within the vector representation of words. In this paper, we propose a new approach that uses document embeddings to build multiple models that accurately label news articles as reliable or fake. We also present a benchmark on different architectures that detect fake news using binary or multi-labeled classification. We evaluated the models on five large news corpora using accuracy, precision, and recall. We obtained better results than more complex state-of-the-art Deep Neural Network models. We observe that the most important factor for obtaining high accuracy is the document encoding, not the classification model's complexity.

翻译：随着大众媒体从新闻报道的严谨性向社交媒体的个性化转移，个性化社交媒体成为了新的常态。虽然媒体数字化进程带来了许多优势，但也增加了通过虚假新闻传递误导、不实信息和歪曲信息的风险。这种有害现象的出现成功地让社会极化并操纵公众对特定话题的看法，例如选举、疫苗接种等。在社交媒体上传播的此类信息可能扭曲公众认知并产生社会动荡，同时缺少传统新闻学糅合了的严格性。自然语言处理和机器学习技术对于开发有效的假新闻检测工具至关重要。使用文本数据上下文的模型对于解决假新闻检测问题至关重要，因为它们成功将语言特征编码为单词的向量表示。在本文中，我们提出了一种新的方法，该方法使用文档嵌入来构建多个模型，可准确地将新闻文章标记为可靠或虚假。我们还提供了一个基准，用于使用二进制或多标签分类检测虚假新闻的不同体系结构。我们在五个大型新闻语料库上评估了模型，使用精度、召回率和准确性来衡量。我们比更复杂的最先进的深层神经网络模型获得了更好的结果。我们观察到获得高准确度的最重要因素是文档编码而不是分类模型的复杂性。