Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling. We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of $K$ base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this framework is that it treats the LLM as a black box, requiring no fine-tuning of its parameters. Another advantage is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications Assuming each topic is a $β$-Hölder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We also provide a minimax lower bound and show that the rate of our method matches with the lower bound when $β\leq 1$. Additionally, we apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches.
翻译:传统主题建模方法通常应用于词频统计,而未考虑词语出现的上下文环境。近期大规模语言模型(LLMs)的发展提供了能够捕捉词语深层语义与关联的上下文词嵌入表示。本研究旨在利用此类嵌入表示改进主题建模方法。我们使用预训练的LLM将每个文档转换为词嵌入序列,并将该序列建模为泊松点过程,其强度测度表示为$K$个基测度的凸组合,每个基测度对应一个主题。为估计这些主题,我们提出一种灵活的算法:该算法整合了传统主题建模方法,并在处理前采用网络舍入、处理后应用核平滑进行增强。此框架的优势之一在于将LLM视为黑箱,无需微调其参数;另一优势在于能够无缝集成任意传统主题建模方法作为即插即用模块,无需进行算法修改。假设每个主题是嵌入空间上的$β$-赫尔德平滑强度测度,我们建立了该方法的收敛速率。同时给出了极小极大下界,并证明当$β\leq 1$时,本方法的速率与下界匹配。此外,我们将该方法应用于多个数据集,实验证据表明其相较于传统主题建模方法具有显著优势。