There has been a growing interest in developing machine learning (ML) models for code learning tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and testing sets, were not well designed. Specifically, no prior work on the aforementioned topics considered the timestamps of code and comments during evaluation (e.g., examples in the testing set might be from 2010 and examples from the training set might be from 2020). This may lead to evaluations that are inconsistent with the intended use cases of the ML models. In this paper, we formalize a novel time-segmented evaluation methodology, as well as the two methodologies commonly used in the literature: mixed-project and cross-project. We argue that time-segmented methodology is the most realistic. We also describe various use cases of ML models and provide a guideline for using methodologies to evaluate each use case. To assess the impact of methodologies, we collect a dataset of code-comment pairs with timestamps to train and evaluate several recent code learning ML models for the comment generation and method naming tasks. Our results show that different methodologies can lead to conflicting and inconsistent results. We invite the community to adopt the time-segmented evaluation methodology.
翻译:尽管多指标类集模型的效用大为提高,但评价方法,即人们将数据集分成培训、验证和测试组的方法,没有很好地设计,具体地说,以前没有就上述专题开展任何工作,审议评价期间代码和评论的时间标记(例如,测试组中的例子可能来自2010年,培训组的例子可能来自2020年),这可能导致评价与多指标类集模型的预期使用情况不一致。在本文件中,我们正式确定了一个新的、有时间分解的评价方法,以及文献中常用的两种方法:混合项目和跨项目。我们说,最现实的方法是时间分解方法。我们还介绍了多指标类集模型的各种使用案例,并提供了使用方法评价每例使用案例的准则。为了评估方法的影响,我们收集了一套有时间分解的数据集,用时间标本来培训和评估若干有时间分解的评估方法,并评估了最近各种代码模型的生成结果。我们用不同方法来说明不同方法的生成结果。我们用不同方法来说明不同方法的生成方法,我们用不同方法来说明不同方法的生成结果。