Researches about COVID-19 has increased largely, no matter in the biology field or the others. This research conducted a text analysis using LDA topic model. We firstly scraped totally 1127 articles and 5563 comments on SCMP covering COVID-19 from Jan 20 to May 19, then we trained the LDA model and tuned parameters based on the Cv coherence as the model evaluation method. With the optimal model, dominant topics, representative documents of each topic and the inconsistence between articles and comments are analyzed. 3 possible improvements are discussed at last.
翻译:有关COVID-19的研究大增,无论在生物学领域还是其他方面都是如此,这一研究利用LDA专题模型进行了文字分析,我们首先从1月20日至5月19日彻底删除了涉及COVID-19的1127条和5563条关于SCMP的评论,然后我们根据Cv一致性培训了LDA模型和调制参数,作为示范评价方法,分析了最佳模型、主要专题、每个专题的代表性文件以及文章与评论不一致的问题。