Due to the practical importance of regular expressions (regexes, for short), there has been a lot of research to automatically generate regexes from positive and negative string examples. We tackle the problem of learning regexes faster from positive and negative strings by relying on a novel approach called `neural example splitting'. Our approach essentially split up each example string into multiple parts using a neural network trained to group similar substrings from positive strings. This helps to learn a regex faster and, thus, more accurately since we now learn from several short-length strings. We propose an effective regex synthesis framework called `SplitRegex' that synthesizes subregexes from `split' positive substrings and produces the final regex by concatenating the synthesized subregexes. For the negative sample, we exploit pre-generated subregexes during the subregex synthesis process and perform the matching against negative strings. Then the final regex becomes consistent with all negative strings. SplitRegex is a divided-and-conquer framework for learning target regexes; split (=divide) positive strings and infer partial regexes for multiple parts, which is much more accurate than the whole string inferring, and concatenate (=conquer) inferred regexes while satisfying negative strings. We empirically demonstrate that the proposed SplitRegex framework substantially improves the previous regex synthesis approaches over four benchmark datasets.
翻译:由于常规表达式的实际重要性(regexes, 短短的), 有很多研究可以自动生成正和负字符串示例的正和负字符串实例。 我们通过依赖所谓的“ 神经示例分裂” 的新颖方法,解决了从正和负字符串中更快学习正和负字符串的问题。 我们的方法基本上将每个例字符串分成多个部分, 使用经训练的神经网络将相似的子字符串从正字符串中分组。 这有助于学习正和更快的regex, 因此, 自从我们从一些短字符里学习以来, 更准确地说, 更准确地说, 自我们从一些短字符串中学习了正和负的正的正对数化合成框架。 我们建议了一个有效的正正正正正的反正对数化和正反正对比框架。 分解后, 右对正对正的反正对正对正对正的反比度框架( 在正对正的正对正对正的对正对正对立中, ) 和对正对正的反正的对正对立框架的对立框架框架( 。 分化前和正的对正的对正的对立框架是整的对立和正的对正的对正的对立和对正的对立框架框架框架框架,对立和对准。