Knowledge graphs have been proven extremely useful in powering diverse applications in semantic search and natural language understanding. In this paper, we present GraphGen4Code, a toolkit to build code knowledge graphs that can similarly power various applications such as program search, code understanding, bug detection, and code automation. GraphGen4Code uses generic techniques to capture code semantics with the key nodes in the graph representing classes, functions, and methods. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). Our toolkit uses named graphs in RDF to model graphs per program, or can output graphs as JSON. We show the scalability of the toolkit by applying it to 1.3 million Python files drawn from GitHub, 2,300 Python modules, and 47 million forum posts. This results in an integrated code graph with over 2 billion triples. We make the toolkit to build such graphs as well as the sample extraction of the 2 billion triples graph publicly available to the community for use.
翻译:在语义搜索和自然语言理解的各种应用中,事实证明知识图表非常有用。在本文中,我们展示了GreabGen4Code,这是一个建立代码知识图表的工具,可以类似地赋予诸如程序搜索、代码理解、错误检测和代码自动化等各种应用力量的代码图。GregGen4Code使用通用技术,用图中代表类别、功能和方法的关键节点来捕捉代码语义。Edges显示了功能使用情况(例如,数据如何通过功能电话流,来自对真实代码的方案分析)和功能文件(例如,代码文件、使用文件或论坛讨论,如StackOverslow)等。我们的工具包使用RDF中命名的图表来模拟每个程序图形,或者可以将图表输出为 Json 。我们通过将工具包应用到从 GitHub 提取的130万 Python 文件、 2 300 Python 模块和4 700万个论坛站点,来显示工具的可缩缩放性。通过一个超过20亿个三亿个以上的集的集集解码图集。我们用工具包来建立社区,作为工具,并公开提取20亿个图的样本。我们使用这些图。