As the Internet grows in size, so does the amount of text based information that exists. For many application spaces it is paramount to isolate and identify texts that relate to a particular topic. While one-class classification would be ideal for such analysis, there is a relative lack of research regarding efficient approaches with high predictive power. By noting that the range of documents we wish to identify can be represented as positive linear combinations of the Vector Space Model representing our text, we propose Conical classification, an approach that allows us to identify if a document is of a particular topic in a computationally efficient manner. We also propose Normal Exclusion, a modified version of Bi-Normal Separation that makes it more suitable within the one-class classification context. We show in our analysis that our approach not only has higher predictive power on our datasets, but is also faster to compute.
翻译:随着互联网规模的扩大,基于文本的信息数量也随之增加。对于许多应用空间来说,孤立和识别与特定主题有关的文本至关重要。虽然单类分类是进行这种分析的理想方法,但相对缺乏对预测力高的有效方法的研究。我们注意到,我们希望确定的文件范围可以作为矢量空间模型中代表我们文本的正线性组合来表示,因此我们建议Concical分类,这种方法使我们能够以计算效率的方式确定文件是否属于特定主题。我们还提议了正常排斥,即经过修改的双类分离版本,使之更适合单类分类。我们的分析表明,我们的方法不仅对我们数据集具有更高的预测力,而且更快地进行了计算。