Regular expressions are a classical concept in formal language theory. Regular expressions in programming languages (RegEx) such as JavaScript, feature non-standard semantics of operators (e.g. greedy/lazy Kleene star), as well as additional features such as capturing groups and references. While symbolic execution of programs containing RegExes appeals to string solvers natively supporting important features of RegEx, such a string solver is hitherto missing. In this paper, we propose the first string theory and string solver that natively provide such a support. The key idea of our string solver is to introduce a new automata model, called prioritized streaming string transducers (PSST), to formalize the semantics of RegEx-dependent string functions. PSSTs combine priorities, which have previously been introduced in prioritized finite-state automata to capture greedy/lazy semantics, with string variables as in streaming string transducers to model capturing groups. We validate the consistency of the formal semantics with the actual JavaScript semantics by extensive experiments. Furthermore, to solve the string constraints, we show that PSSTs enjoy nice closure and algorithmic properties, in particular, the regularity-preserving property (i.e., pre-images of regular constraints under PSSTs are regular), and introduce a sound sequent calculus that exploits these properties and performs propagation of regular constraints by means of taking post-images or pre-images. Although the satisfiability of the string constraint language is undecidable, we show that our approach is complete for the so-called straight-line fragment. We evaluate the performance of our string solver on over 195000 string constraints generated from an open-source RegEx library. The experimental results show the efficacy of our approach, drastically improving the existing methods in both precision and efficiency.
翻译:正规语言理论中的常规表达式是经典的正则表达式。 常规表达式( 常规表达式) 。 常规表达式( regEx), 如 JavaScript 等编程语言( REgEx) 常规表达式( RegaScript), 特点是操作者的非标准语义( 例如贪婪/ lazy Kleene Star), 以及捕捉组和引用者等。 虽然执行 RegExs 的程式象征性地向本地支持 RegExs 重要特征的字符串解决问题者发出呼吁, 但目前还缺少这种字符串求解解调器。 我们的字符串理论和字符串解调解调器的关键理念是引入新的自动表达式模型模型模式, 叫做串调控件器( 优先流调用字符串调试器), 我们的定序流调控的功能化工具是正常的性能。 我们的内定序调调控器是正常的性动作, 。