With the rapid development of Internet technology, people have more and more access to a variety of web page resources. At the same time, the current rapid development of deep learning technology is often inseparable from the huge amount of Web data resources. On the other hand, NLP is also an important part of data processing technology, such as web page data extraction. At present, the extraction technology of web page text mainly uses a single heuristic function or strategy, and most of them need to determine the threshold manually. With the rapid growth of the number and types of web resources, there are still problems to be solved when using a single strategy to extract the text information of different pages. This paper proposes a web page text extraction algorithm based on multi-feature fusion. According to the text information characteristics of web resources, DOM nodes are used as the extraction unit to design multiple statistical features, and high-order features are designed according to heuristic strategies. This method establishes a small neural network, takes multiple features of DOM nodes as input, predicts whether the nodes contain text information, makes full use of different statistical information and extraction strategies, and adapts to more types of pages. Experimental results show that this method has a good ability of web page text extraction and avoids the problem of manually determining the threshold.
翻译:随着互联网技术的迅速发展,人们越来越多地获得各种网页资源。与此同时,当前深层次学习技术的迅速发展往往与大量网络数据资源密不可分。另一方面,NLP也是数据处理技术的一个重要部分,例如网页数据提取。目前,网页文本的提取技术主要使用单一的休眠功能或战略,其中大多数人需要手工确定阈值。随着网络资源的数量和种类的迅速增长,在使用单一战略提取不同网页的文本信息时仍有问题需要解决。本文建议采用基于多功能聚合的网页文本提取算法。根据网络资源的文本信息特性,DOM节点被用作设计多种统计特征的提取单位,而高阶特征则根据超潮战略设计。这种方法将DOM节点的多个特性作为输入,预测节点是否包含文本信息,充分利用不同的统计资料和提取战略,并调整网页的提取算法。实验结果显示,确定网页的精选能力,从而避免采用更精确的脚本。