Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can either become very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values which must be matched. As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to widely applicable. To demonstrate the potential effectiveness of our approach, we train a model using the extracted corpus of regular expressions for the class of semantic type classification. While our approach generally yields results that are inferior to the state of the art, our training data is much smaller and simpler and a closer analysis of the performance results suggests this approach holds significant promise. We also demonstrate the possibility of using uncurated regular expressions for unsupervised learning.
翻译:在从一组数据值中学习常规表达式方面,已经做了大量工作。根据领域,这一方法可以非常成功。然而,需要大量时间来学习这些表达式,因此,在出现肮脏数据时,这些表达式可能变得非常复杂或不准确。在面临大量必须匹配的数值时,手工撰写常规表达式的替代办法变得不吸引人。作为替代办法,我们建议从大量手工编写但未经精练的常规表达式中学习,从公共储存库中挖掘出来。这一方法的优点是,我们能够从一组管理费有限的字符串中提取突出的特征。由于常规表达式涉及广泛的应用领域,我们期望这些表达式能够广泛应用。为了展示我们的方法的潜在效力,我们用抽取的常规表达式来培训一个模型,用于语系类型分类。虽然我们的方法通常产生低于艺术状态的结果,但我们的培训数据要小得多,更简单得多,而且对绩效结果的分析更接近,这表明这种做法很有希望。我们还表明,使用不精确的常规表达式进行不受监督的学习的可能性。