Word clouds became a standard tool for presenting results of natural language processing methods such as topic modelling. They exhibit most important words, where word size is often chosen proportional to the relevance of words within a topic. In the latent Dirichlet allocation (LDA) model, word clouds are graphical presentations of a vector of weights for words within a topic. These vectors are the result of a statistical procedure based on a specific corpus. Therefore, they are subject to uncertainty coming from different sources as sample selection, random components in the optimization algorithm, or parameter settings. A novel approach for presenting word clouds including information on such types of uncertainty is introduced and illustrated with an application of the LDA model to conference abstracts.
翻译:云成为展示自然语言处理方法(如专题建模)结果的标准工具,它们展示了最重要的单词,其中单词大小的选择往往与主题内文字的相关性成比例。在潜伏的dirichlet分配(LDA)模型中,字云是主题内文字重量矢量的图形表达。这些矢量是基于特定要素的统计程序的结果。因此,它们可能受到来自不同来源的不确定性的影响,如抽样选择、优化算法中的随机组成部分或参数设置。一种介绍单词云的新办法,包括这类不确定性的信息,在会议摘要中引入并用LDA模型加以说明。