We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.
翻译:我们为源代码理解提供CV4Code 。 我们的方法通过将每个片段作为二维图像, 自然将上下文编码, 并通过明确的空间表达方式保留基本的结构信息。 为了将片段编译成图像, 我们建议使用基于 ASII 的代码点图像表示法, 便利源代码图像的快速生成, 并消除编码中因 RGB 像素代表而产生的冗余。 此外, 由于源代码被视为图像, 不需要将源代码作为背景分析( 排序) 或同步树分割, 从而将每个片段作为二维图像加以利用, 使拟议方法自然将上下文编码编码编码编码编码编码编码编码, 并保留基础结构信息。 CV4Code甚至可以将基于源代码的代码拼凑不正确编码编译为图像。 我们通过学习 CV4Cdecode 网络来预测功能任务的有效性, 也就是说, 不需要将源代码分析( ) 第一次解析( ) 或同步树分解), 使拟议方法对任何特定编程语言语言语言语言语言表达结果, 显示从两个版本的版本的版本的版本格式, 。