Software artifacts often interact with each other throughout the software development cycle. Associating related artifacts is a common practice for effective documentation and maintenance of software projects. Conventionally, to register the link between an issue report and its associated commit, developers manually include the issue identifier in the message of the relevant commit. Research has shown that developers tend to forget to connect said artifacts manually, resulting in a loss of links. Hence, several link recovery techniques were proposed to discover and revive such links automatically. However, the literature mainly focuses on improving the prediction accuracy on a randomly-split test set, while neglecting other important aspects of this problem, including the effect of time and generalizability of the predictive models. In this paper, we propose LinkFormer to address this problem from three aspects; 1) Accuracy: To better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on textual and metadata of issues and commits. 2) Data leakage: To empirically assess the impact of time through the splitting policy, we train and test our proposed model along with several existing approaches on both randomly- and temporally split data. 3) Generalizability: To provide a generic model that can perform well across different projects, we further fine-tune LinkFormer in two transfer learning settings. We empirically show that researchers should preserve the temporal flow of data when training learning-based models to resemble the real-world setting. In addition, LinkFormer significantly outperforms the state-of-the-art by large margins. LinkFormer is also capable of extending the knowledge it learned to unseen projects with little to no historical data.
翻译:在软件开发周期中,软件的文物往往相互互动。将相关文物合并在一起是有效记录和维护软件项目的一个常见做法。 常规上,为了登记问题报告及其相关承诺之间的联系,开发者手工将问题标识器纳入相关承诺的信息中。 研究显示,开发者往往忘记将所述文物人工连接在一起,从而导致失去链接。 因此,提出了若干链接回收技术,以自动发现和恢复这种链接。 但是,文献主要侧重于提高随机拼接测试集的预测准确性,同时忽视这一问题的其他重要方面,包括预测模型的时间和一般性的影响。 在本文件中,我们提议使用“链接格式”从三个方面解决这一问题; (1) 准确性:为了更好地利用背景信息进行预测,我们使用变换结构和微调的多部预先训练模型,关于问题的文本和元数据,并承诺。 数据渗漏:通过分解政策对时间链接的影响进行经验性评估,我们培训和测试我们提议的模型,连同若干现有方法一起,包括时间和一般模型的影响,以及预测模型对时间性模型的影响。 我们建议“链接”从三个阶段进行模拟学习,我们也可以进一步学习。