将越南语对话从脸书页面分组到为查波特建立培训数据集 (Clustering Vietnamese Conversations From Facebook Page To Build Training Dataset For Chatbot) - 专知论文

会员服务 ·

0

Facebook Page · 簇 · Chatbot · 数据集 · Facebook ·

2022 年 1 月 3 日

Clustering Vietnamese Conversations From Facebook Page To Build Training Dataset For Chatbot

翻译：将越南语对话从脸书页面分组到为查波特建立培训数据集

Trieu Hai Nguyen,Thi-Kim-Ngoan Pham,Thi-Hong-Minh Bui,Thanh-Quynh-Chau Nguyen

from arxiv, Preprint submitted to JJCIT (Revised version, December 10, 2021)

The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERT$_{base}$. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines for training chatbot.

翻译：建设聊天室的最大挑战是培训数据。所需的数据必须现实且大到足以培训聊天室。我们创建了一个工具, 从Facebook 页面的Facebook 送信员那里获取实际培训数据。在文本预处理步骤之后, 新获得的数据集生成FVnC 和样本数据集。我们使用越南语的BERT再培训( PhoBERT) 来提取我们的文本数据特征。 K- Means 和 DBSCAN 群集算法被用于基于 PhoBERT$ ⁇ base} 输出嵌入的集群任务。我们用 V 度量分和 Silhouette 得分来评估组合算法的性能。我们还展示了 PhoBERT 相对于样本数据集和 Wiki 数据集中其他特征提取模型的效率。将两个组合评价结合起来的GridSearch 算法也是为了找到最佳参数。通过组合这样的对话, 我们节省了大量时间和精力来建立数据和故事线用于培训聊天室。

0

相关内容

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【DeepMind】强化学习教程，83页ppt

【DeepMind】强化学习教程，83页ppt

专知会员服务

158+阅读 · 2020年8月7日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

TensorFlow 2.0 学习资源汇总

TensorFlow 2.0 学习资源汇总

专知会员服务

67+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

基于神经网络和强化学习的车辆装配系统中的多载量小车实时调度方法

国家自然科学基金

4+阅读 · 2014年12月31日

机器人可穿戴FBG人工皮肤感知机理及信息处理方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

套子代数的Hochschild上同调及套的分类

国家自然科学基金

3+阅读 · 2014年12月31日

随机进程代数模型的Fluid逼近问题研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于几何哈希算法的酶设计

国家自然科学基金

0+阅读 · 2013年12月31日

基于Landau-Zener-Stuckelberg效应的超快电荷量子比特研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于近似动态规划理论的电力系统随机动态经济调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于LDPC码的压缩感知测量矩阵构造及性能分析

国家自然科学基金

0+阅读 · 2012年12月31日

关于AI-半环簇与 Conway半环簇的研究

国家自然科学基金

1+阅读 · 2012年12月31日

扩展的模糊逻辑与基于蕴涵算子的Rough逻辑

国家自然科学基金

0+阅读 · 2011年12月31日

You Are What You Write: Preserving Privacy in the Era of Large Language Models

You Are What You Write: Preserving Privacy in the Era of Large Language Models

Arxiv

0+阅读 · 2022年4月20日

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Arxiv

0+阅读 · 2022年4月19日

On The Cross-Modal Transfer from Natural Language to Code through Adapter Modules

Arxiv

0+阅读 · 2022年4月19日

ArcaneQA: Dynamic Program Induction and Contextualized Encoding for Knowledge Base Question Answering

Arxiv

0+阅读 · 2022年4月17日

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Arxiv

0+阅读 · 2022年4月16日

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Arxiv

0+阅读 · 2022年4月15日

QAConv: Question Answering on Informative Conversations

Arxiv

0+阅读 · 2022年4月14日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Arxiv

14+阅读 · 2019年6月19日

Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Arxiv

15+阅读 · 2018年1月5日

VIP会员

文章信息

相关主题

相关VIP内容

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【DeepMind】强化学习教程，83页ppt

【DeepMind】强化学习教程，83页ppt

专知会员服务

158+阅读 · 2020年8月7日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

TensorFlow 2.0 学习资源汇总

TensorFlow 2.0 学习资源汇总

专知会员服务

67+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】迈向具有高维结果的可靠且稳健的因果推断

《美海军分布式海上作战（DMO）概念：最新情况》

Gemini 2.5：推动前沿，具备先进推理、多模态、长上下文及下一代智能体能力

【ICML2025教程】联想记忆的现代方法

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

You Are What You Write: Preserving Privacy in the Era of Large Language Models

You Are What You Write: Preserving Privacy in the Era of Large Language Models

Arxiv

0+阅读 · 2022年4月20日

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Arxiv

0+阅读 · 2022年4月19日

On The Cross-Modal Transfer from Natural Language to Code through Adapter Modules

Arxiv

0+阅读 · 2022年4月19日

ArcaneQA: Dynamic Program Induction and Contextualized Encoding for Knowledge Base Question Answering

Arxiv

0+阅读 · 2022年4月17日

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

Arxiv

0+阅读 · 2022年4月16日

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Arxiv

0+阅读 · 2022年4月15日

QAConv: Question Answering on Informative Conversations

Arxiv

0+阅读 · 2022年4月14日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Arxiv

14+阅读 · 2019年6月19日

Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Arxiv

15+阅读 · 2018年1月5日

相关基金

基于神经网络和强化学习的车辆装配系统中的多载量小车实时调度方法

国家自然科学基金

4+阅读 · 2014年12月31日

机器人可穿戴FBG人工皮肤感知机理及信息处理方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

套子代数的Hochschild上同调及套的分类

国家自然科学基金

3+阅读 · 2014年12月31日

随机进程代数模型的Fluid逼近问题研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于几何哈希算法的酶设计

国家自然科学基金

0+阅读 · 2013年12月31日

基于Landau-Zener-Stuckelberg效应的超快电荷量子比特研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于近似动态规划理论的电力系统随机动态经济调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于LDPC码的压缩感知测量矩阵构造及性能分析

国家自然科学基金

0+阅读 · 2012年12月31日

关于AI-半环簇与 Conway半环簇的研究

国家自然科学基金

1+阅读 · 2012年12月31日

扩展的模糊逻辑与基于蕴涵算子的Rough逻辑

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员