Context: The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis such as testing and vulnerabilities detection. A large number of studies poses challenges to the community to understand the current landscape. Objective: We aim to summarize the current knowledge in the area of applied machine learning for source code analysis. Method: We investigate studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we carried out an extensive literature search and identified 364 primary studies published between 2002 and 2021. We summarize our observations and findings with the help of the identified studies. Results: Our findings suggest that the usage of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task, and summarize the employed machine learning techniques. Additionally, we collate a comprehensive list of available datasets and tools useable in this context. Finally, we summarize the perceived challenges in this area that include availability of standard datasets, reproducibility and replicability, and hardware resources.
翻译:背景:机器学习技术的进步鼓励研究人员将这些技术应用于使用源代码分析(例如测试和脆弱性检测)的大量软件工程任务。大量研究对社区提出了了解当前环境的挑战。目标:我们的目标是总结应用机器学习领域用于源代码分析的现有知识。方法:我们调查属于12类软件工程任务的研究,以及用于解决这些问题的相应机器学习技术、工具和数据集。为了这样做,我们进行了广泛的文献搜索,确定了在2002年至2021年期间出版的364份初级研究。我们在所确定研究的帮助下总结了我们的意见和调查结果。结果:我们的调查结果表明,在源代码分析任务中使用机器学习技术的情况在不断增加。我们综合了每项任务通常使用的步骤和总体工作流程,并总结了所使用的机器学习技术。此外,我们整理了一份综合清单,列出了在这方面可以使用的可用数据集和工具。最后,我们总结了这一领域存在的各种挑战,包括标准数据集的提供、可复制性和可复制性以及硬件资源。