Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embedding spaces: (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context insensitive from Glove and context-sensitive from BERT models) jointly from one or many sources to improve topic quality and better deal with polysemy. In doing so, we first build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). We then identify one or more relevant source domain(s) and transfer knowledge to guide meaningful learning in the sparse target domain. Within neural topic modeling, we quantify the quality of topics and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. Introducing the multi-source multi-view embedding spaces, we have shown state-of-the-art neural topic modeling using 6 source (high-resource) and 5 target (low-resource) corpora.
翻译:尽管字嵌入和专题是互补的表述,但过去的一些著作只使用了(神经)专题模型中预先训练的字嵌入,以解决短文本或小文件收集中的数据宽度问题。这项工作提出了使用多视图嵌入空间的新颖神经专题模型框架:(1) 预先训练的专题编入,和(2) 预先训练的字编成(Glove的文体不敏感,BERT模型对背景敏感),从一个或多个来源联合进行,以提高专题质量,更好地处理多元问题。在这样做的时候,我们首先建立各种预先训练的专题(即Top Pool)和字嵌入(即Word Pool)的集合。然后我们确定一个或多个相关的源域,并转让知识,以指导稀少的目标域的有意义的学习。在神经模型中,我们用短文本、长文本、小型和大型文件库集成(IR)来量化专题和文件表述的质量,使用短文本、长文本、小、大文本以及来自新闻和医学领域的嵌入软件库,我们展示了多源数据库5号数据库。