In early January 2020, after China reported the first cases of the new coronavirus (SARS-CoV-2) in the city of Wuhan, unreliable and not fully accurate information has started spreading faster than the virus itself. Alongside this pandemic, people have experienced a parallel infodemic, i.e., an overabundance of information, some of which misleading or even harmful, that has widely spread around the globe. Although Social Media are increasingly being used as information source, Web Search Engines, like Google or Yahoo!, still represent a powerful and trustworthy resource for finding information on the Web. This is due to their capability to capture the largest amount of information, helping users quickly identify the most relevant, useful, although not always the most reliable, results for their search queries. This study aims to detect potential misleading and fake contents by capturing and analysing textual information, which flow through Search Engines. By using a real-world dataset associated with recent CoViD-19 pandemic, we first apply re-sampling techniques for class imbalance, then we use existing Machine Learning algorithms for classification of not reliable news. By extracting lexical and host-based features of associated Uniform Resource Locators (URLs) for news articles, we show that the proposed methods, so common in phishing and malicious URLs detection, can improve the efficiency and performance of classifiers. Based on these findings, we suggest that the use of both textual and URLs features can improve the effectiveness of fake news detection methods.
翻译:2020年1月初,中国报告了武汉市新冠状病毒(SARS-COV-2)的首例病例,此后,不可靠且不完全准确的信息开始比病毒本身传播速度快。除了这一流行病之外,人们还经历了一种平行的杂交,即信息过于丰富,其中一些误导甚至有害,在全球广泛传播。虽然社会媒体越来越多地被用作信息来源,但像谷歌或雅虎这样的网络搜索引擎仍然是在网上查找信息的强大和可靠的资源。这是因为他们有能力获取最大数量的信息,帮助用户迅速查明最相关、最有用、但并非最可靠、最可靠的信息,从而获得搜索查询结果。这项研究的目的是通过采集和分析通过搜索引擎传播的文本信息来发现潜在的误导性和假内容。我们首先使用与最近CViD-19大流行相关的真实世界数据集,然后使用重现的机器学习算法来对不可靠的新闻进行分类。通过提取词汇和主机读性文件来改进我们共同的图像检测方法,从而在统一资源实验室数据库和主机级数据库中展示了我们共同的文本检测方法。