G2T：一种基于预训练语言模型和社区检测的主题建模简单而多用框架 (G2T: A simple but versatile framework for topic modeling based on pretrained language model and community detection) - 专知论文

会员服务 ·

0

主题建模 · 文档表示 · 语义图 · 社区检测 · 主题模型 ·

2023 年 4 月 13 日

G2T: A simple but versatile framework for topic modeling based on pretrained language model and community detection

翻译：G2T：一种基于预训练语言模型和社区检测的主题建模简单而多用框架

Leihang Zhang,Jiapeng Liu,Qiang Yan

It has been reported that clustering-based topic models, which cluster high-quality sentence embeddings with an appropriate word selection method, can generate better topics than generative probabilistic topic models. However, these approaches suffer from the inability to select appropriate parameters and incomplete models that overlook the quantitative relation between words with topics and topics with text. To solve these issues, we propose graph to topic (G2T), a simple but effective framework for topic modelling. The framework is composed of four modules. First, document representation is acquired using pretrained language models. Second, a semantic graph is constructed according to the similarity between document representations. Third, communities in document semantic graphs are identified, and the relationship between topics and documents is quantified accordingly. Fourth, the word--topic distribution is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T achieved state-of-the-art performance on both English and Chinese documents with different lengths. Human judgements demonstrate that G2T can produce topics with better interpretability and coverage than baselines. In addition, G2T can not only determine the topic number automatically but also give the probabilistic distribution of words in topics and topics in documents. Finally, G2T is publicly available, and the distillation experiments provide instruction on how it works.

翻译：据报道，基于聚类的主题模型利用适当的词语选择方法和高质量的句子嵌入，可以生成比生成式概率主题模型更好的主题。然而，这些方法在选择适当参数和忽略文本中词语与主题以及主题与文本之间的定量关系的不完整模型方面存在问题。为了解决这些问题，我们提出了图到主题(G2T)——一种简单但有效的主题建模框架。该框架由四个模块组成。首先，使用预训练语言模型获取文档表示形式。其次，根据文档表示之间的相似度构建语义图。第三，识别文档语义图中的社区，并相应地量化主题与文档之间的关系。第四，基于TFIDF的变体计算词-主题分布。自动评估表明，G2T在长度不同的英文和中文文档上均实现了最先进的性能。人类判断论证了G2T相比对照组可以生产具有更好可解释性和覆盖性的主题。此外，G2T不仅可以自动确定主题个数，而且可以给出词在主题中和主题在文档中的概率分布。最后，G2T是公开可用的，蒸馏实验提供了关于它的工作说明。

0

相关内容

主题建模

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

专知会员服务

18+阅读 · 2022年2月26日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

专知会员服务

52+阅读 · 2020年1月20日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

六篇 EMNLP 2019【图神经网络(GNN)+NLP】相关论文

六篇 EMNLP 2019【图神经网络(GNN)+NLP】相关论文

专知会员服务

72+阅读 · 2019年11月3日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

ERNIE Tutorial（论文笔记 + 实践指南）

ERNIE Tutorial（论文笔记 + 实践指南）

AINLP

30+阅读 · 2019年8月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

综述：Image Caption 任务之语句多样性

综述：Image Caption 任务之语句多样性

PaperWeekly

22+阅读 · 2018年11月30日

【论文推荐】最新十二篇情感分析相关论文—自然语言推理框架、网络事件、多任务学习、实时情感变化检测、多因素分析、深度语境词表示

【论文推荐】最新十二篇情感分析相关论文—自然语言推理框架、网络事件、多任务学习、实时情感变化检测、多因素分析、深度语境词表示

专知

22+阅读 · 2018年5月7日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

【论文推荐】最新6篇图像描述生成相关论文—语言为枢纽、细粒度、生成器、注意力机制、策略梯度优化、判别性目标

【论文推荐】最新6篇图像描述生成相关论文—语言为枢纽、细粒度、生成器、注意力机制、策略梯度优化、判别性目标

专知

11+阅读 · 2018年3月20日

【论文推荐】最新八篇主题模型相关论文—主题建模优化、变分推断、情绪强度、神经语言模型、搜索、社区聚合、主题建模的问题、光谱学习

【论文推荐】最新八篇主题模型相关论文—主题建模优化、变分推断、情绪强度、神经语言模型、搜索、社区聚合、主题建模的问题、光谱学习

专知

13+阅读 · 2018年3月8日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

基于关键词的大规模链接数据搜索技术研究

国家自然科学基金

7+阅读 · 2015年12月31日

基于Markov方法的大规模多阶段任务系统可靠性建模与分析

国家自然科学基金

1+阅读 · 2013年12月31日

基于海量语料自然标注信息的汉语自然语块分析

国家自然科学基金

0+阅读 · 2013年12月31日

通用Web结构化信息检索引擎的关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

增量协同过滤模型研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于汉语话题的句际关系自动分析研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

多媒体问答中的若干关键问题研究

国家自然科学基金

0+阅读 · 2011年12月31日

汉语句子理解中语义和句法整合的认知神经机制

国家自然科学基金

0+阅读 · 2009年12月31日

不同层级句法结构中语义加工的认知神经机制

国家自然科学基金

0+阅读 · 2009年12月31日

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Arxiv

0+阅读 · 2023年5月31日

Red Teaming Language Model Detectors with Language Models

Arxiv

0+阅读 · 2023年5月31日

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Arxiv

0+阅读 · 2023年5月30日

Contextual Object Detection with Multimodal Large Language Models

Arxiv

0+阅读 · 2023年5月29日

Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition

Arxiv

0+阅读 · 2023年5月29日

Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming

Arxiv

0+阅读 · 2023年5月28日

An Empirical Comparison of LM-based Question and Answer Generation Methods

Arxiv

0+阅读 · 2023年5月26日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Latent Relation Language Models

Arxiv

21+阅读 · 2019年8月21日

VIP会员

文章信息

相关主题

相关VIP内容

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

专知会员服务

18+阅读 · 2022年2月26日

【ICML2020】统一预训练伪掩码语言模型

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

【清华大学】知识增强的常识性故事生成预训练模型，A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

专知会员服务

52+阅读 · 2020年1月20日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

六篇 EMNLP 2019【图神经网络(GNN)+NLP】相关论文

六篇 EMNLP 2019【图神经网络(GNN)+NLP】相关论文

专知会员服务

72+阅读 · 2019年11月3日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

ERNIE Tutorial（论文笔记 + 实践指南）

ERNIE Tutorial（论文笔记 + 实践指南）

AINLP

30+阅读 · 2019年8月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

综述：Image Caption 任务之语句多样性

综述：Image Caption 任务之语句多样性

PaperWeekly

22+阅读 · 2018年11月30日

【论文推荐】最新十二篇情感分析相关论文—自然语言推理框架、网络事件、多任务学习、实时情感变化检测、多因素分析、深度语境词表示

【论文推荐】最新十二篇情感分析相关论文—自然语言推理框架、网络事件、多任务学习、实时情感变化检测、多因素分析、深度语境词表示

专知

22+阅读 · 2018年5月7日

笔记 | Sentiment Analysis

笔记 | Sentiment Analysis

黑龙江大学自然语言处理实验室

10+阅读 · 2018年5月6日

【论文推荐】最新6篇图像描述生成相关论文—语言为枢纽、细粒度、生成器、注意力机制、策略梯度优化、判别性目标

【论文推荐】最新6篇图像描述生成相关论文—语言为枢纽、细粒度、生成器、注意力机制、策略梯度优化、判别性目标

专知

11+阅读 · 2018年3月20日

【论文推荐】最新八篇主题模型相关论文—主题建模优化、变分推断、情绪强度、神经语言模型、搜索、社区聚合、主题建模的问题、光谱学习

【论文推荐】最新八篇主题模型相关论文—主题建模优化、变分推断、情绪强度、神经语言模型、搜索、社区聚合、主题建模的问题、光谱学习

专知

13+阅读 · 2018年3月8日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

相关论文

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Arxiv

0+阅读 · 2023年5月31日

Red Teaming Language Model Detectors with Language Models

Arxiv

0+阅读 · 2023年5月31日

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Arxiv

0+阅读 · 2023年5月30日

Contextual Object Detection with Multimodal Large Language Models

Arxiv

0+阅读 · 2023年5月29日

Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition

Arxiv

0+阅读 · 2023年5月29日

Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming

Arxiv

0+阅读 · 2023年5月28日

An Empirical Comparison of LM-based Question and Answer Generation Methods

Arxiv

0+阅读 · 2023年5月26日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Latent Relation Language Models

Arxiv

21+阅读 · 2019年8月21日

相关基金

基于关键词的大规模链接数据搜索技术研究

国家自然科学基金

7+阅读 · 2015年12月31日

基于Markov方法的大规模多阶段任务系统可靠性建模与分析

国家自然科学基金

1+阅读 · 2013年12月31日

基于海量语料自然标注信息的汉语自然语块分析

国家自然科学基金

0+阅读 · 2013年12月31日

通用Web结构化信息检索引擎的关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

增量协同过滤模型研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于汉语话题的句际关系自动分析研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

多媒体问答中的若干关键问题研究

国家自然科学基金

0+阅读 · 2011年12月31日

汉语句子理解中语义和句法整合的认知神经机制

国家自然科学基金

0+阅读 · 2009年12月31日

不同层级句法结构中语义加工的认知神经机制

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员