We examine the problem of learning a single occurrence regular expression with interleaving (SOIRE) from a set of text strings with noise. SOIRE has unrestricted support for interleaving and covers most of the regular expressions in practice. Learning SOIREs is challenging because it needs heavy computation and text strings usually contains noise in practice. Most of the previous work only learns restricted SOIREs and is not robust on noisy data. To tackle these issues, we proposea noise-tolerant differentiable learning approach SOIREDL for SOIRE. We design a neural network to simulate SOIRE matching of given text strings and theoretically prove that a class of the set of parameters learnt by the neural network, called faithful encoding, is one-to-one corresponding to SOIRE for a bounded size. Based on this correspondence, we interpret the target SOIRE from the set of parameters of the neural network by exploring nearest faithful encodings. Experimental results show that SOIREDL outperforms the state-of-the-art approaches especially on noisy data.
翻译:我们研究从一组带有噪音的文本字符串中学习一个单一的常规表达式(SOIRE)的问题。 SOIRE 不受限制地支持插入并覆盖实践中的大多数常规表达式。 学习 SOIRES 具有挑战性, 因为它需要大量计算, 文本字符串通常含有实际中的噪音。 大部分先前的工作只学习限制的 SOIRES, 而对吵闹的数据不强。 为了解决这些问题, 我们为SOIRE 提议了一个不动不动的不同学习方法 SOIREDL 。 我们设计了一个神经网络, 模拟SOIRE 匹配特定文本字符串, 并在理论上证明神经网络所学的一组参数, 称为忠实编码, 是SOIRE 的一对一对应的受约束大小 。 基于这一通信, 我们从神经网络的一组参数中解释SOIRE 目标, 探索最近的忠实编码。 实验结果显示, SOIREDL 超越了最先进的方法, 特别是在噪音数据上。