Artificial intelligence (AI) for software engineering (SE) tasks has recently achieved promising performance. In this paper, we investigate to what extent the pre-trained language model truly understands those SE tasks such as code search, code summarization, etc. We conduct a comprehensive empirical study on a board set of AI for SE (AI4SE) tasks by feeding them with variant inputs: 1) with various masking rates and 2) with sufficient input subset method. Then, the trained models are evaluated on different SE tasks, including code search, code summarization, and duplicate bug report detection. Our experimental results show that pre-trained language models are insensitive to the given input, thus they achieve similar performance in these three SE tasks. We refer to this phenomenon as overinterpretation, where a model confidently makes a decision without salient features, or where a model finds some irrelevant relationships between the final decision and the dataset. Our study investigates two approaches to mitigate the overinterpretation phenomenon: whole word mask strategy and ensembling. To the best of our knowledge, we are the first to reveal this overinterpretation phenomenon to the AI4SE community, which is an important reminder for researchers to design the input for the models and calls for necessary future work in understanding and implementing AI4SE tasks.
翻译:软件工程(SE)任务的人工智能(AI)最近取得了有希望的成绩。在本文件中,我们调查了经过培训的语文模型在多大程度上真正理解了诸如代码搜索、代码总和等SE任务等SE任务。我们用变量投入的方式,对一套用于SE(AI4SE)任务的AI(AI4SE)任务董事会进行一项全面的经验性研究:1) 使用各种掩码率,2) 使用足够的投入子集方法。然后,对经过培训的模型进行了不同SE任务的评估,包括代码搜索、代码汇总和重复的错误报告检测。我们的实验结果表明,经过培训的语言模型对给定的投入不敏感,因此在SE的三项任务中也取得了类似的表现。我们把这种现象说成是过度解释,在一个模型自信地作出没有突出特征的决定,或者一个模型发现最后决定与数据集之间有些不相干的关系。我们的研究研究了两种减轻过度解释现象的方法:整个字面遮掩策略和聚合。我们最了解的是,我们首先向AI4SE社区揭示了这种过分解释的现象,这是对未来任务进行必要解释的重要提醒。