Regular expression is a technology widely used in software development for extracting textual data, validating the structure of textual documents, or formatting data. Existing regular expression generation works always assumes that the examples are faultless, however, in real industrial scenarios, this assumption does not fully hold. In this paper, we present a simple but effective templated-based approach to generate regular expressions over noisy examples. Specifically, we design an abstract data form (namely, MetaParam) to approximately describe and cluster the input examples. Then, we propose a practical dynamic thresholding scheme to filter out anomalous examples. Finally, we design a template-based regular expression generation algorithm, which is efficient, interpretable and extensible. We performed an experimental evaluation on two different extraction tasks applied to realworld datasets and obtained promising results in terms of precision. Moreover, gMeta achieves excellent results in real industrial scenarios.
翻译:常规表达式是一种在软件开发中广泛使用的技术,用于提取文本数据,验证文本文档的结构或格式化数据。现有的常规表达式生成工作总是假设这些例子无过错,然而,在实际的工业假设中,这一假设并不完全正确。在本文中,我们提出了一个简单而有效的基于模板的方法,以产生常规表达式,而不是吵闹的例子。具体地说,我们设计了一个抽象的数据表格(即MetaParam),以大致描述和集中输入的例子。然后,我们提出了一个实用的动态阈值计划,以过滤异常的例子。最后,我们设计了一个基于模板的常规表达式生成算法,该算法是高效、可解释和可扩展的。我们对适用于现实世界数据集的两种不同的提取任务进行了实验性评估,并在精确性方面取得了有希望的结果。此外,Gmeta在实际工业情景中取得了极好的结果。