多于单词: 中流dirichlet 分配模型的居居址化 (More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models) - 专知论文

会员服务 ·

0

潜在狄利克雷分配 · 词元分析器 · MoDELS · LDA · 卡方（分布） ·

2021 年 8 月 24 日

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

翻译：多于单词: 中流dirichlet 分配模型的居居址化

Jin Cheevaprawatdomrong,Alexandra Schofield,Attapol T. Rutherford

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared test, t-statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The Chi-squared, t, and WPE tokenizers are trained on Wikipedia text to look for words that should be grouped together, such as compound nouns, proper nouns, and complex event verbs. We propose a new metric for measuring the clustering quality in settings where the vocabularies of the models differ. Based on this metric and other established metrics, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.

翻译：在传统意义上,LDA(LDA)在一系列文件中录用文字,用文字文档共同发现其潜在主题。但是,如何在没有中国和泰语等标明词界线的情况下,为语言取得最佳结果还不清楚。我们在这里探索使用Pearson的奇质配方测试、t-统计学和Word Pair Encoding(WPE)来制作标语,作为LDA模型的投入。 Chi-squared、 t 和WPE 标识器在维基百科文本上接受了培训,以寻找应该组合的词,如复合名词、适当的名词和复杂的事件动词。我们提出了在模型的词汇不同环境中测量群集质量的新指标。根据这一指标和其他既定指标,我们显示,用合并标语培训的专题导致主题键更加清晰、更加一致,在区分主题方面比未合并模型更有效。

0

相关内容

潜在狄利克雷分配

潜在狄利克雷分配

概率主题模型综述

专知会员服务

36+阅读 · 2021年6月16日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

必须收藏！MIT-Gilbert老爷子《矩阵图解》，一张图看透矩阵

必须收藏！MIT-Gilbert老爷子《矩阵图解》，一张图看透矩阵

专知会员服务

111+阅读 · 2020年11月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【论文推荐】文本摘要简述

【论文推荐】文本摘要简述

专知会员服务

69+阅读 · 2020年7月20日

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

专知会员服务

122+阅读 · 2020年5月30日

【经典书】统计学习导论，434页pdf，斯坦福大学

【经典书】统计学习导论，434页pdf，斯坦福大学

专知会员服务

238+阅读 · 2020年4月29日

【MIT】时间序列GAN，Subadditivity of Probability Divergences

专知会员服务

63+阅读 · 2020年3月4日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

word2Vec总结

AINLP

3+阅读 · 2019年11月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇主题模型相关论文—动态主题模型、主题趋势、大规模并行采样、随机采样、非参主题建模

【论文推荐】最新六篇主题模型相关论文—动态主题模型、主题趋势、大规模并行采样、随机采样、非参主题建模

专知

14+阅读 · 2018年6月24日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新六篇主题模型相关论文—收敛率、大规模、深度主题建模、优化、情绪强度、广义动态主题模型

【论文推荐】最新六篇主题模型相关论文—收敛率、大规模、深度主题建模、优化、情绪强度、广义动态主题模型

专知

11+阅读 · 2018年3月29日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Arxiv

0+阅读 · 2021年9月9日

Analyzing the Surprising Variability in Word Embedding Stability Across Languages

Arxiv

0+阅读 · 2021年9月9日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

A Comparison of Latent Semantic Analysis and Correspondence Analysis for Text Mining

Arxiv

0+阅读 · 2021年7月25日

Latent Relation Language Models

Arxiv

21+阅读 · 2019年8月21日

Topic Modelling of Everyday Sexism Project Entries

Arxiv

3+阅读 · 2018年4月5日

Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions

Arxiv

3+阅读 · 2018年3月17日

The Search Problem in Mixture Models

Arxiv

3+阅读 · 2018年2月24日

SentiPers: A Sentiment Analysis Corpus for Persian

Arxiv

5+阅读 · 2018年1月23日

Continuous Time Dynamic Topic Models

Arxiv

3+阅读 · 2015年5月16日

VIP会员

文章信息

相关主题

潜在狄利克雷分配

词元分析器

卡方（分布）

相关VIP内容

概率主题模型综述

专知会员服务

36+阅读 · 2021年6月16日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

必须收藏！MIT-Gilbert老爷子《矩阵图解》，一张图看透矩阵

必须收藏！MIT-Gilbert老爷子《矩阵图解》，一张图看透矩阵

专知会员服务

111+阅读 · 2020年11月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【论文推荐】文本摘要简述

【论文推荐】文本摘要简述

专知会员服务

69+阅读 · 2020年7月20日

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

专知会员服务

122+阅读 · 2020年5月30日

【经典书】统计学习导论，434页pdf，斯坦福大学

【经典书】统计学习导论，434页pdf，斯坦福大学

专知会员服务

238+阅读 · 2020年4月29日

【MIT】时间序列GAN，Subadditivity of Probability Divergences

专知会员服务

63+阅读 · 2020年3月4日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

word2Vec总结

AINLP

3+阅读 · 2019年11月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇主题模型相关论文—动态主题模型、主题趋势、大规模并行采样、随机采样、非参主题建模

【论文推荐】最新六篇主题模型相关论文—动态主题模型、主题趋势、大规模并行采样、随机采样、非参主题建模

专知

14+阅读 · 2018年6月24日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新六篇主题模型相关论文—收敛率、大规模、深度主题建模、优化、情绪强度、广义动态主题模型

【论文推荐】最新六篇主题模型相关论文—收敛率、大规模、深度主题建模、优化、情绪强度、广义动态主题模型

专知

11+阅读 · 2018年3月29日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Arxiv

0+阅读 · 2021年9月9日

Analyzing the Surprising Variability in Word Embedding Stability Across Languages

Arxiv

0+阅读 · 2021年9月9日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

A Comparison of Latent Semantic Analysis and Correspondence Analysis for Text Mining

Arxiv

0+阅读 · 2021年7月25日

Latent Relation Language Models

Arxiv

21+阅读 · 2019年8月21日

Topic Modelling of Everyday Sexism Project Entries

Arxiv

3+阅读 · 2018年4月5日

Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions

Arxiv

3+阅读 · 2018年3月17日

The Search Problem in Mixture Models

Arxiv

3+阅读 · 2018年2月24日

SentiPers: A Sentiment Analysis Corpus for Persian

Arxiv

5+阅读 · 2018年1月23日

Continuous Time Dynamic Topic Models

Arxiv

3+阅读 · 2015年5月16日

微信扫码咨询专知VIP会员