We present a method for the classification of multi-labelled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time. Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labelled instances colliding in the same region. This approach is referred to as clashing. We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labelled streams.
翻译:为数据流应用专门设计了多标签文本文件的分类方法,这需要使用恒定的内存和不断的处理时间处理几乎无限的数据序列。我们的方法包括一种在线程序,用于将文本有效映射成低维特征空间,并将这一空间分成一系列区域,系统提取并保存用于预测多标签文本说明的统计数据。文件以文字顺序输入系统,绘制到分区区域,并使用从同一区域标记的相撞实例中计算出来的统计数据附加说明。这个方法被称为冲突。我们用真实世界文本数据来说明方法,将结果与其他文本分类器比较。此外,我们分析显示系统预测性能的表示空间维度对一系列区域的影响。我们的结果显示,在线嵌入的确实接近于全物理智能TF和TF-IDF空间的几何测量方法。模型在最精确的方法方面获得了竞争性的F度量度,使用的计算资源要少得多。我们用真实世界文本数据来说明方法来比较。我们用实际世界文本数据中的方法,将结果与其他文本分类方法进行比较。此外,我们还提供了关于系统显示空间的表示空间维度对系统预测性效果的影响的分析,比其它的宏观测量方法要快。此外,方法从更接近于能测量法流,比其他的流学方法。