Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster - the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 News- groups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.
翻译:近年来,得益于大型语言模型(LLMs)及其高维嵌入的兴起,语义文本分类取得了显著进展。尽管LLM嵌入常被用于在向量数据库中通过语义相似性存储和检索文本,但文本语料库中的全局结构语义关系往往仍不透明。本文提出一种嵌套密度聚类方法,用于推断语义相关文本的层次树结构。该方法首先通过在LLM嵌入空间中搜索密集簇,识别具有强语义相似性的文本。随着密度标准逐步放宽,这些密集簇逐渐合并为更分散的簇,直至整个数据集被表示为单个簇——即树的根节点。通过将密集簇嵌入到逐渐扩散的簇中,我们构建了一种能够捕捉文本间层次化语义关系的树形结构。我们以科学摘要的文本数据分类为例,阐述了该方法的应用流程。这使得无需预定义类别即可实现研究领域及其子领域的数据驱动发现。为评估该方法的普适性,我们进一步将其应用于已建立的基准数据集,如20 Newsgroups和IMDB 50k电影评论,证明了其跨领域的鲁棒性。最后,我们探讨了该方法在科学计量学、主题演化分析中的潜在应用,阐明了嵌套密度树如何揭示文本数据集中的语义结构及其演化规律。