The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERT$_{base}$. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines for training chatbot.
翻译:建设聊天室的最大挑战是培训数据。 所需的数据必须现实且大到足以培训聊天室。 我们创建了一个工具, 从Facebook 页面的Facebook 送信员那里获取实际培训数据。 在文本预处理步骤之后, 新获得的数据集生成FVnC 和样本数据集。 我们使用越南语的BERT再培训( PhoBERT) 来提取我们的文本数据特征。 K- Means 和 DBSCAN 群集算法被用于基于 PhoBERT$ ⁇ base} 输出嵌入的集群任务。 我们用 V 度量分和 Silhouette 得分来评估组合算法的性能。 我们还展示了 PhoBERT 相对于样本数据集和 Wiki 数据集中其他特征提取模型的效率。 将两个组合评价结合起来的GridSearch 算法也是为了找到最佳参数。 通过组合这样的对话, 我们节省了大量时间和精力来建立数据和故事线用于培训聊天室。