评价软件工程用户反馈分析的预培训模式:应用审查分类研究 (Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews)

Context: Mobile app reviews written by users on app stores or social media are significant resources for app developers.Analyzing app reviews have proved to be useful for many areas of software engineering (e.g., requirement engineering, testing). Automatic classification of app reviews requires extensive efforts to manually curate a labeled dataset. When the classification purpose changes (e.g. identifying bugs versus usability issues or sentiment), new datasets should be labeled, which prevents the extensibility of the developed models for new desired classes/tasks in practice. Recent pre-trained neural language models (PTM) are trained on large corpora in an unsupervised manner and have found success in solving similar Natural Language Processing problems. However, the applicability of PTMs is not explored for app review classification Objective: We investigate the benefits of PTMs for app review classification compared to the existing models, as well as the transferability of PTMs in multiple settings. Method: We empirically study the accuracy and time efficiency of PTMs compared to prior approaches using six datasets from literature. In addition, we investigate the performance of the PTMs trained on app reviews (i.e. domain-specific PTMs) . We set up different studies to evaluate PTMs in multiple settings: binary vs. multi-class classification, zero-shot classification (when new labels are introduced to the model), multi-task setting, and classification of reviews from different resources. The datasets are manually labeled app review datasets from Google Play Store, Apple App Store, and Twitter data. In all cases, Micro and Macro Precision, Recall, and F1-scores will be used and we will report the time required for training and prediction with the models.

翻译：用户在应用程序仓库或社交媒体上撰写的移动应用程序审查是应用程序开发者的重要资源。分析应用程序审查已证明对软件工程的许多领域(例如需求工程、测试)有用。软件自动分类审查需要大量努力手工翻译标签数据集。当分类目的变化(例如识别错误相对于可用性问题或情绪)时,应贴上新的数据集,这妨碍了开发模型在实践中为新需要的类别/任务提供可延续性。最近的预先培训的神经语言模型(PTM)以不受监督的方式在大型孔雀上接受培训,并发现成功解决类似的自然语言处理问题。然而,对软件自动分类审查的实用性并不进行广泛的探索,以手工整理数据。当分类目的改变(例如识别错误相对于可用性问题或情绪)时,应贴上标签,新数据集的可转移性。方法:我们用模型来研究所有PTM的准确性和时间和时间效率,使用文献的六种分类方法。此外,我们调查了PTM的进度分析案例,在多种内部数据审查中,我们所培训的Stencial 。

相关内容