Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving institutions. In this paper, we explore different learning models and feature representations to determine the best performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. We focus our evaluation on three datasets that we created from three different Web archives. Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
翻译:网络档案数据通常包含高质量的文件,对于建立专门的文件汇编非常有用,例如科学数字图书馆和技术报告储存库;为此,非常需要采用自动办法,从网络归档机构收集的大量文件中区分感兴趣的文件;在本文件中,我们探索不同的学习模式和特征说明,以确定从网络存档数据中确定感兴趣文件的最佳表现模式;具体地说,我们研究从整个文件或文件的具体部分提取的机器学习和深层学习模式和“一袋单词”特征,以及捕捉文件结构的结构特征;我们着重评价我们从三个不同的网络档案中创建的三个数据集;我们的实验结果显示,仅侧重于特定部分文件(而不是全文)的布尔分类方法优于所有三个数据集的比较方法。