探测网络搜索引擎上的COVID-19号假新闻 (Detection of fake news on CoViD-19 on Web Search Engines)

In early January 2020, after China reported the first cases of the new coronavirus (SARS-CoV-2) in the city of Wuhan, unreliable and not fully accurate information has started spreading faster than the virus itself. Alongside this pandemic, people have experienced a parallel infodemic, i.e., an overabundance of information, some of which misleading or even harmful, that has widely spread around the globe. Although Social Media are increasingly being used as information source, Web Search Engines, like Google or Yahoo!, still represent a powerful and trustworthy resource for finding information on the Web. This is due to their capability to capture the largest amount of information, helping users quickly identify the most relevant, useful, although not always the most reliable, results for their search queries. This study aims to detect potential misleading and fake contents by capturing and analysing textual information, which flow through Search Engines. By using a real-world dataset associated with recent CoViD-19 pandemic, we first apply re-sampling techniques for class imbalance, then we use existing Machine Learning algorithms for classification of not reliable news. By extracting lexical and host-based features of associated Uniform Resource Locators (URLs) for news articles, we show that the proposed methods, so common in phishing and malicious URLs detection, can improve the efficiency and performance of classifiers. Based on these findings, we think that usage of both textual and URLs features can improve the effectiveness of fake news detection methods.

翻译：2020年1月初,中国报告了武汉市新冠状病毒(SARS-COV-2)的首例病例,此后,在中国报告了武汉市新冠状病毒(SARS-COV-2)的首例后,不可靠和不完全准确的信息开始比病毒本身传播速度快。除了这一流行病之外,人们还经历了一种平行的恋情,即信息过于丰富,其中一些误导或甚至有害,在全球广泛传播。虽然社会媒体越来越多地被用作信息来源,但像谷歌或雅虎这样的网络搜索引擎仍然代表着在网上查找信息的强大和可靠的资源。这是因为他们能够捕捉到最大数量的信息,帮助用户迅速查明最相关、最有用(尽管并不总是最可靠)的搜索结果。这项研究的目的是通过采集和分析通过搜索引擎传播的文本信息来发现潜在的误导性和假内容。我们首先使用与最近CViD-19大流行相关的真实世界数据集来进行重新采样,然后我们使用现有的机器学习算法来分类不可靠的新闻。通过提取最有用、最有用但并非最可靠、最可靠、最可靠、最可靠、最可靠、最可靠、最可靠、最可靠、最有价值的搜索的搜索查询结果的结果。这项研究的目的是,通过采集、最有价值的图像的域域域域域域域图,从而可以显示这些共同的域税的域域税的域域域域域的域域的域的域的域的域法,从而显示,从而显示我们用来用来改进了共同的域税和主的域税的域税的域税的域税基的域税的域税的域税的域税的域税的域税法,从而可以用来改进。