Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
翻译:经过良好训练的机器学习模型,利用大量开放源码软件数据,现已成为使许多软件工程任务自动化的有趣方法。一些SE任务都受制于这一方法,过去几年来业绩逐渐改善,模型和培训方法较好。更多的、更多样化、更清洁、标签化的数据对培训来说更好;但建立高质量的数据集既费时又具有挑战性。增加清洁、标签化数据的数量和多样性的方法通常具有广泛适用性。对于一些语言(例如Ruby),标签化数据不那么丰富;在另一些语言(例如JavaScript),现有数据可能更侧重于某些应用领域,因此也较少多样性。作为克服数据瓶颈的一个办法,我们提出的证据表明,不同语言的人类编码(发挥相同功能)相当相似,特别是维护标识命名模式;我们进一步提出证据表明,识别数据是软件工程任务培训数据的一个非常重要的组成部分。我们利用这个不成熟的现象来寻找证据,即现有多语言培训数据(跨越不同语言)可能更加集中的证据,这种数据检索功能可以被广泛用来测量。我们用这个数据定义的代码功能可以用来测量。