There has been a significant effort by the research community to address the problem of providing methods to organize documentation with the help of information Retrieval methods. In this report paper, we present several experiments with some stream analysis methods to explore streams of text documents. We use only dynamic algorithms to explore, analyze, and organize the flux of text documents. This document shows a case study with developed architectures of a Text Document Stream Organization, using incremental algorithms like Incremental TextRank, and IS-TFIDF. Both these algorithms are based on the assumption that the mapping of text documents and their document-term matrix in lower-dimensional evolving networks provides faster processing when compared to batch algorithms. With this architecture, and by using FastText Embedding to retrieve similarity between documents, we compare methods with large text datasets and ground truth evaluation of clustering capacities. The datasets used were Reuters and COVID-19 emotions. The results provide a new view for the contextualization of similarity when approaching flux of documents organization tasks, based on the similarity between documents in the flux, and by using mentioned algorithms.
翻译:研究界作出了重大努力,以解决在信息检索方法的帮助下提供文件组织方法的问题。在这份报告文件中,我们介绍了一些实验,用一些流分析方法来探索文本文件流。我们只使用动态算法来探索、分析和组织文本文件的通量。本文件展示了利用递增 TextRank 和 IS-TFIDF 等递增算法的文本文件流组织发达结构的案例研究。这两种算法都基于以下假设:低维发展网络中文本文件的绘图及其文件期中矩阵与批量算法相比能够提供更快的处理。根据这种结构,我们使用快速图样嵌入来检索文件之间的相似性,我们将方法与大文本数据集进行比较,并对组合能力进行地面真相评估。所使用的数据集是路透社和COVID-19 情绪。结果为在使用文件通量组织任务时,根据通量文档的相似性,以及使用上述算法,对相似性的背景提供了新的观点。