Web spam is a big problem for search engine users in World Wide Web. They use deceptive techniques to achieve high rankings. Although many researchers have presented the different approach for classification and web spam detection still it is an open issue in computer science. Analyzing and evaluating these websites can be an effective step for discovering and categorizing the features of these websites. There are several methods and algorithms for detecting those websites, such as decision tree algorithm. In this paper, we present a systematic framework based on CHAID algorithm and a modified string matching algorithm (KMP) for extract features and analysis of these websites. We evaluated our model and other methods with a dataset of Alexa Top 500 Global Sites and Bing search engine results in 500 queries.
翻译:网络垃圾邮件是万维网搜索引擎用户面临的一个大问题。 他们使用欺骗性技术达到高分。 虽然许多研究人员提出了分类和网络垃圾邮件检测的不同方法, 但它仍然是计算机科学中的一个开放问题。 分析和评估这些网站可以成为发现和分类这些网站特征的有效步骤。 检测这些网站有多种方法和算法, 例如决策树算法 。 在本文中, 我们提出了一个基于CHAID算法的系统框架, 和一个修改的字符串匹配算法( KMP), 用于提取这些网站的特征和分析。 我们用Alexa 500顶全球站点和Bing搜索引擎的数据集对我们的模型和其他方法进行了500个查询结果的评估 。