Topic Modeling refers to the problem of discovering the main topics that have occurred in corpora of textual data, with solutions finding crucial applications in numerous fields. In this work, inspired by the recent advancements in the Natural Language Processing domain, we introduce FAME, an open-source framework enabling an efficient mechanism of extracting and incorporating textual features and utilizing them in discovering topics and clustering text documents that are semantically similar in a corpus. These features range from traditional approaches (e.g., frequency-based) to the most recent auto-encoding embeddings from transformer-based language models such as BERT model family. To demonstrate the effectiveness of this library, we conducted experiments on the well-known News-Group dataset. The library is available online.
翻译:在这项工作中,我们受到最近自然语言处理领域进展的启发,引入了一个开放源码框架,这个开放源码框架能够建立一个高效的机制,提取和纳入文字特征,并利用这些特征来发现在文体中具有内在相似性的专题和组合文本文件。这些特征包括传统方法(例如,基于频率)和最新的基于变压器的语言模型(例如,BERT模型家庭)自动编码嵌入软件,例如BERT模型家庭。为了证明这一图书馆的有效性,我们进行了关于著名新闻组数据集的实验。图书馆可在线查阅。