分散式表示即将语言表示为稠密、低维、连续的向量。 研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars, man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息,而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

VIP内容

题目: 基于深度学习的主题模型研究

摘要: 主题模型作为一个发展二十余年的研究问题,一直是篇章级别文本语义理解的重要工具.主题模型善于从一组文档中抽取出若干组关键词来表达该文档集的核心思想,因而也为文本分类、信息检索、自动摘要、文本生成、情感分析等其他文本分析任务提供重要支撑.虽然基于三层贝叶斯网络的传统概率主题模型在过去十余年已被充分研究,但随着深度学习技术在自然语言处理领域的广泛应用,结合深度学习思想与方法的主题模型焕发出新的生机.研究如何整合深度学习的先进技术,构建更加准确高效的文本生成模型成为基于深度学习主题建模的主要任务.本文首先概述并对比了传统主题模型中四个经典的概率主题模型与两个稀疏约束的主题模型.接着对近几年基于深度学习的主题模型研究进展进行综述,分析其与传统模型的联系、区别与优势,并对其中的主要研究方向和进展进行归纳、分析与比较.此外,本文还介绍了主题模型常用公开数据集及评测指标.最后,总结了主题模型现有技术的特点,并分析与展望了基于深度学习的主题模型的未来发展趋势。

成为VIP会员查看完整内容
0
74

最新内容

We compare two orthogonal semi-supervised learning techniques, namely tri-training and pretrained word embeddings, in the task of dependency parsing. We explore language-specific FastText and ELMo embeddings and multilingual BERT embeddings. We focus on a low resource scenario as semi-supervised learning can be expected to have the most impact here. Based on treebank size and available ELMo models, we select Hungarian, Uyghur (a zero-shot language for mBERT) and Vietnamese. Furthermore, we include English in a simulated low-resource setting. We find that pretrained word embeddings make more effective use of unlabelled data than tri-training but that the two approaches can be successfully combined.

0
0
下载
预览

最新论文

We compare two orthogonal semi-supervised learning techniques, namely tri-training and pretrained word embeddings, in the task of dependency parsing. We explore language-specific FastText and ELMo embeddings and multilingual BERT embeddings. We focus on a low resource scenario as semi-supervised learning can be expected to have the most impact here. Based on treebank size and available ELMo models, we select Hungarian, Uyghur (a zero-shot language for mBERT) and Vietnamese. Furthermore, we include English in a simulated low-resource setting. We find that pretrained word embeddings make more effective use of unlabelled data than tri-training but that the two approaches can be successfully combined.

0
0
下载
预览
父主题
Top