The unprecedented outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, continues to be a significant worldwide problem. As a result, a surge of new COVID-19 related research has followed suit. The growing number of publications requires document organization methods to identify relevant information. In this paper, we expand upon our previous work with clustering the CORD-19 dataset by applying multi-dimensional analysis methods. Tensor factorization is a powerful unsupervised learning method capable of discovering hidden patterns in a document corpus. We show that a higher-order representation of the corpus allows for the simultaneous grouping of similar articles, relevant journals, authors with similar research interests, and topic keywords. These groupings are identified within and among the latent components extracted via tensor decomposition. We further demonstrate the application of this method with a publicly available interactive visualization of the dataset.
翻译:前所未有的严重急性呼吸系统综合症科罗纳病毒-2(SARS-COV-2)或COVID-19的爆发继续是一个严重的全球性问题,因此,随之而来的是新的COVID-19相关研究的激增。越来越多的出版物要求采用文件组织方法来识别相关信息。在本文件中,我们通过应用多维分析方法,扩大我们以前将CORD-19数据集组合在一起的工作。电磁分解是一种强大的、不受监督的学习方法,能够在文件资料中发现隐藏的模式。我们表明,通过对数据集进行更高级的分类,可以同时将类似的文章、相关期刊、具有类似研究兴趣的作者和主题关键词组合在一起。这些组合在通过高温分解提取出来的潜在组成部分中和其中被识别。我们进一步展示了这种方法的应用,并公开对数据集进行互动视觉。