Regular expressions (regexes) are widely used in different fields of computer science, such as programming languages, string processing, and databases. However, existing tools for synthesizing or repairing regexes always assume that the input examples are faultless. In real industrial scenarios, this assumption does not entirely hold. Thus, this paper presents a simple but effective templated-based approach to generate regular expressions over noisy examples. Specifically, we present a data model (i.e., MetaParam) to extract features of strings for clustering all examples. Then, we propose a practical dynamic thresholding scheme to filter out anomalous examples via detecting knee points on CDF graphs. Finally, we design a template-based algorithm to translate a finite of positve examples to regular expression, which is efficient, interpretable, and extensible. We performed an experimental evaluation on four different extraction tasks applied to real-world datasets and obtained promising results in terms of F-measure. Moreover, gMeta achieves excellent results in real industrial scenarios.
翻译:常规表达式(regexes)被广泛用于计算机科学的不同领域,例如编程语言、字符串处理和数据库。然而,现有的综合或修复正数工具总是假设输入示例是无过错的。在实际的工业假设中,这一假设并不完全有效。因此,本文件提出了一个简单而有效的基于模板的方法,在吵闹实例上生成常规表达式。具体地说,我们提出了一个数据模型(即MetaParam),以提取所有示例组的字符串特性。然后,我们提出了一个实用的动态阈值方案,通过探测CDF图中的膝盖点来过滤异常示例。最后,我们设计了一个基于模板的算法,将一定的假设示例转化为常规表达式,这是高效、可解释和可扩展的。我们对适用于真实世界数据集的四种不同的提取任务进行了实验性评价,并在F-计量方面获得了有希望的结果。此外,gMeta在实际工业假设中取得了极好的结果。