The use of domain knowledge is generally found to improve query efficiency in content filtering applications. In particular, tangible benefits have been achieved when using knowledge-based approaches within more specialized fields, such as medical free texts or legal documents. However, the problem is that sources of domain knowledge are time-consuming to build and equally costly to maintain. As a potential remedy, recent studies on Wikipedia suggest that this large body of socially constructed knowledge can be effectively harnessed to provide not only facts but also accurate information about semantic concept-similarities. This paper describes a framework for document filtering, where Wikipedia's concept-relatedness information is combined with a domain ontology to produce semantic content classifiers. The approach is evaluated using Reuters RCV1 corpus and TREC-11 filtering task definitions. In a comparative study, the approach shows robust performance and appears to outperform content classifiers based on Support Vector Machines (SVM) and C4.5 algorithm.
翻译:通常发现,使用域知识可以提高内容过滤应用程序的查询效率,特别是,在更专门的领域,如免费医疗文本或法律文件,使用以知识为基础的方法,取得了实际效益;然而,问题在于,域知识的来源需要花费时间才能建立,而且同样需要花费大量费用才能维持;作为可能的补救办法,最近关于维基百科的研究表明,能够有效地利用这大批社会构建的知识,不仅提供事实,而且提供关于语义概念差异的准确信息;本文件描述了一个文件过滤框架,其中维基百科的概念相关性信息与域内科相结合,以产生语义内容分类器;该方法利用路透社RCV1软件和TREC-11过滤任务定义进行评估;在一项比较研究中,该方法显示强健的性能,并似乎在支持矢量机(SVM)和C4.5算法的基础上超越了内容分类系统。