English research articles (RAs) are an essential genre in academia, so the attempts to employ NLP to assist the development of academic writing ability have received considerable attention in the last two decades. However, there has been no study employing feature engineering techniques to investigate the linguistic features of RAs of different academic impacts (i.e., the papers of high/moderate citation times published in the journals of high/moderate impact factors). This study attempts to extract micro-level linguistic features in high- and moderate-impact journal RAs, using feature engineering methods. We extracted 25 highly relevant features from the Corpus of English Journal Articles through feature selection methods. All papers in the corpus deal with COVID-19 medical empirical studies. The selected features were then validated of the classification performance in terms of consistency and accuracy through supervised machine learning methods. Results showed that 24 linguistic features such as the overlapping of content words between adjacent sentences, the use of third-person pronouns, auxiliary verbs, tense, emotional words provide consistent and accurate predictions for journal articles with different academic impacts. Lastly, the random forest model is shown to be the best model to fit the relationship between these 24 features and journal articles with high and moderate impacts. These findings can be used to inform academic writing courses and lay the foundation for developing automatic evaluation systems for L2 graduate students.
翻译:英文研究文章(RAs)是学术界的一个重要特色,因此,在过去二十年中,试图利用NLP协助发展学术写作能力的努力受到相当重视;然而,没有研究采用地物工程技术调查具有不同学术影响的RAs语言特征(即高/中度引用论文在高/中度影响因素杂志上发表);这项研究试图利用地物工程方法,在高和中度影响RAs杂志中提取微观语言特征,我们通过特质选择方法,从《英国杂志》杂志文章Corpus中提取了25个高度相关的特质;所有实体文件都涉及COVID-19医学经验研究;随后,通过监督的机器学习方法,鉴定了分类在一致性和准确性方面的表现;结果显示,24种语言特征,如相邻的句子重叠、第三人文、辅助动词、紧张、情感语言为具有不同学术影响的杂志文章提供了一致和准确的预测。最后,随机森林模型显示,随机森林模型是用于高级研究结论的最好模型,可以用于发展24种学系之间的自动分析。