Users use Issue Tracking Systems to keep track and manage issue reports in their repositories. An issue is a rich source of software information that contains different reports including a problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. Thus, automatic approaches are proposed to help facilitate the management of issue reports. This paper describes CatIss, an automatic CATegorizer of ISSue reports which is built upon the Transformer-based pre-trained RoBERTa model. CatIss classifies issue reports into three main categories of Bug reports, Enhancement/feature requests, and Questions. First, the datasets provided for the NLBSE tool competition are cleaned and preprocessed. Then, the pre-trained RoBERTa model is fine-tuned on the preprocessed dataset. Evaluating CatIss on about 80 thousand issue reports from GitHub, indicates that it performs very well surpassing the competition baseline, TicketTagger, and achieving 87.2% F1-score (micro average). Additionally, as CatIss is trained on a wide set of repositories, it is a generic prediction model, hence applicable for any unseen software project or projects with little historical data. Scripts for cleaning the datasets, training CatIss, and evaluating the model are publicly available.
翻译:用户使用“ 问题跟踪系统” 来跟踪和管理其存储库中的问题报告。 问题是一个丰富的软件信息来源, 包含不同的报告, 包括问题、 请求新功能, 或只是软件产品问题。 随着这些问题的数量增加, 手工管理这些问题变得更加困难。 因此, 提议自动方法来帮助管理问题报告。 本文描述了基于以变异器为基础的预先培训的RobreTagle模型的SISUE报告的自动CATEGIATISATIS。 CatIs将发行报告分为三个主要类别, 包括错误报告、 增强/ 功能请求和问题。 首先, 为BLBSE工具竞赛提供的数据集被清理和预先处理。 然后, 预先培训的RoBERTA模型将更精确地调整到预处理问题报告。 对来自GitHub的大约80 000份问题报告进行评估的CatIs, 显示它的表现非常出色地超过了竞争基线, 滴Tagger, 以及达到87. 2 % F1- 核心( 微型平均数 ) 。 此外, CatIls 被训练过一个可公开使用的历史数据储存模型, 模型。