使用神经网络的在线语法语法加亮 (On-the-Fly Syntax Highlighting using Neural Networks)

from arxiv, Accepted for publication in the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022)

With the presence of online collaborative tools for software developers, source code is shared and consulted frequently, from code viewers to merge requests and code snippets. Typically, code highlighting quality in such scenarios is sacrificed in favor of system responsiveness. In these on-the-fly settings, performing a formal grammatical analysis of the source code is not only expensive, but also intractable for the many times the input is an invalid derivation of the language. Indeed, current popular highlighters heavily rely on a system of regular expressions, typically far from the specification of the language's lexer. Due to their complexity, regular expressions need to be periodically updated as more feedback is collected from the users and their design unwelcome the detection of more complex language formations. This paper delivers a deep learning-based approach suitable for on-the-fly grammatical code highlighting of correct and incorrect language derivations, such as code viewers and snippets. It focuses on alleviating the burden on the developers, who can reuse the language's parsing strategy to produce the desired highlighting specification. Moreover, this approach is compared to nowadays online syntax highlighting tools and formal methods in terms of accuracy and execution time, across different levels of grammatical coverage, for three mainstream programming languages. The results obtained show how the proposed approach can consistently achieve near-perfect accuracy in its predictions, thereby outperforming regular expression-based strategies.

翻译：随着软件开发者在线合作工具的存在,源代码被共享并经常查阅,从代码查看者到合并请求和代码片段,从代码查看者到合并代码代码代码到合并代码代码代码,源代码的共享和查阅频繁。通常,此类情景中强调质量的代码被牺牲,以有利于系统的反应。在这些实时设置中,对源代码进行正式的语法分析不仅费用昂贵,而且在很多时候,对源代码进行正式的语法分析是该语言的无效衍生。事实上,当前流行的亮点高度依赖常规表达系统,通常远离语言词汇的规格。由于其复杂性,需要定期更新定期表达方式,因为从用户那里收集更多的反馈,而其设计也不利于发现更复杂的语言结构。在这类设置中,对适合源代码的源代码进行基于学习的系统分析不仅昂贵,而且在许多时候,对源代码进行不正确和不正确的语言衍生方法分析,例如代码浏览者和剪切片等。它侧重于开发者的负担,他们可以重新使用语言的辨别战略来制作理想的语法。此外,由于这些表达方式的复杂性,因此,定期的语法比现在的语法强调工具和接近于常规语言结构的准确性方法,从而在预测中可以实现三个水平上的结果。