The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis, such as testing and vulnerability detection. Such a large number of studies hinders the community from understanding the current research landscape. This paper aims to summarize the current knowledge in applied machine learning for source code analysis. We review studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we conducted an extensive literature search and identified 479 primary studies published between 2011 and 2021. We summarize our observations and findings with the help of the identified studies. Our findings suggest that the use of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task and summarize machine learning techniques employed. We identify a comprehensive list of available datasets and tools useable in this context. Finally, the paper discusses perceived challenges in this area, including the availability of standard datasets, reproducibility and replicability, and hardware resources.
翻译:机器学习技术的进步鼓励研究人员将这些技术应用于使用源代码分析的多种软件工程任务,例如测试和脆弱性检测。这类大量研究阻碍社区了解当前的研究环境。本文件旨在总结应用机器学习中当前用于源代码分析的知识。我们审查了属于12类软件工程任务和相应的机器学习技术、工具和数据集的研究,为解决这些问题,我们进行了广泛的文献搜索,确定了2011年至2021年出版的479项初级研究。我们在所确定研究的帮助下总结了我们的意见和调查结果。我们的调查结果表明,在源代码分析任务中使用机器学习技术的情况在不断增加。我们综合了每个任务所使用的常用步骤和总体工作流程,并总结了所使用的机器学习技术。我们确定了一个综合清单,列出了可用于此方面的可用数据集和工具。最后,本文件讨论了这一领域存在的各种挑战,包括标准数据集的可用性、可复制性和可复制性以及硬件资源。