While constructing supervised learning models, we require a labelled example to build a corpus and train a machine learning model. However, majority of studies have built the labelled dataset manually, which in many occasions is a daunting task. To mitigate this problem, we have built an online tool called CodeLabeller. CodeLabeller is a web-based tool that aims to provide an efficient approach in handling the process of labelling Java source files for supervised learning methods at scale by improving the data collection process throughout. CodeLabeller is tested by constructing a corpus of over a thousand source files obtained from a large collection of open-source Java projects and labelling each Java source file with their respective design patterns and summaries. 10 experts in the field of software engineering participated in a usability evaluation of the tool on UEQ-S. The survey demonstrates that the tool is easy to use and meet the needs of the labelling the corpus for supervised classifiers. Apart from assisting researchers to crowdsource a labelled dataset, the tool has practical applicability in software engineering education and assists in building expert ratings for software artefacts.
翻译:在建立受监督的学习模式的同时,我们需要一个有标签的榜样来构建一个文体,并培训一个机器学习模式;然而,大多数研究都手工建立有标签的数据集,这在很多情况下是一项艰巨的任务;为了缓解这一问题,我们建立了一个名为CodeLabeller的在线工具。代码Labeller是一个基于网络的工具,目的是通过改进整个数据收集过程,为处理为受监督的学习方法在规模上给爪哇源文件贴标签的过程提供一种有效的方法。代码Labeller通过建立一个由大量公开来源的爪哇项目收集的一千多个源文件库进行测试,并用各自的设计模式和摘要给每个爪哇源文件贴标签。10名软件工程领域的专家参加了对UEQ-S工具的可用性评价。调查表明,该工具易于使用,并满足了受监督的分类者对材料的标签的需要。除了协助研究人员将标有标签的数据集标出外,该工具在软件工程教育中具有实际适用性,并协助建立软件工艺品的专家评级。