COVID-19 Tweets的语义集群中的转移学习和远程计量的比较研究 (A Comparative Study on Transfer Learning and Distance Metrics in Semantic Clustering over the COVID-19 Tweets)

This paper is a comparison study in the context of Topic Detection on COVID-19 data. There are various approaches for Topic Detection, among which the Clustering approach is selected in this paper. Clustering requires distance and calculating distance needs embedding. The aim of this research is to simultaneously study the three factors of embedding methods, distance metrics and clustering methods and their interaction. A dataset including one-month tweets collected with COVID-19-related hashtags is used for this study. Five methods, from earlier to new methods, are selected among the embedding methods: Word2Vec, fastText, GloVe, BERT and T5. Five clustering methods are investigated in this paper that are: k-means, DBSCAN, OPTICS, spectral and Jarvis-Patrick. Euclidian distance and Cosine distance as the most important distance metrics in this field are also examined. First, more than 7,500 tests are performed to tune the parameters. Then, all the different combinations of embedding methods with distance metrics and clustering methods are investigated by silhouette metric. The number of these combinations is 50 cases. First, the results of these 50 tests are examined. Then, the rank of each method is taken into account in all the tests of that method. Finally, the major variables of the research (embedding methods, distance metrics and clustering methods) are studied separately. Averaging is performed over the control variables to neutralize their effect. The experimental results show that T5 strongly outperforms other embedding methods in terms of silhouette metric. In terms of distance metrics, cosine distance is weakly better. DBSCAN is also superior to other methods in terms of clustering methods.

翻译：本文是COVID-19 数据“ 主题探测” 背景下的一项比较研究。在本文中选择了多种“ 主题探测” 方法, 其中包括集束方法。集束方法需要距离和计算距离嵌入需求。这项研究的目的是同时研究嵌入方法、距离测量和组集方法及其相互作用的三种因素。此研究使用了一套数据集, 包括用COVID-19 相关标签收集的一个月的推文。首先, 从早期到新方法, 在嵌入方法中选择了五种方法: Word2Vec、快速Text、 GloVe、 BERT 和 T5 。本文对五种组合方法进行了调查, 它们是: k- 平均值、 DBSCAN、光谱和 Jarvis- Patrick 。同时, Euclidddd和 Cosine 距离是这个领域最重要的距离测量标准。首先, 超过 7,500 测试是调控值方法的所有不同组合方法的组合方法, 然后由硅度测量测量 T 。高级方法的高级组合的数值是方法。在最后测试中, 这些方法中, 的顺序测量法中, 的顺序中, 这些方法中, 的顺序的顺序的数值是其他方法中, 的计算方法的顺序的顺序的顺序是其他方法中, 。