NLP pipelines with limited or no labeled data, rely on unsupervised methods for document processing. Unsupervised approaches typically depend on clustering of terms or documents. In this paper, we introduce a novel clustering algorithm, Vec2GC (Vector to Graph Communities), an end-to-end pipeline to cluster terms or documents for any given text corpus. Our method uses community detection on a weighted graph of the terms or documents, created using text representation learning. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.
翻译:没有标签数据或有限标签数据的自然语言处理 (NLP) 流水线,需要依靠无监督方法处理文档。无监督方法通常依赖于聚类术语或文档。本文介绍了一种新颖的聚类算法 Vec2GC (向量到图形社区),一种用于聚类给定文本语料库中的术语或文档的端到端管道。我们的方法使用文本表示学习创建的术语或文档的加权图上的社区检测。Vec2GC 聚类算法是一个密度基础的方法,支持层次聚类。