We study the problem of learning a single occurrence regular expression with interleaving (SOIRE) from a set of text strings possibly with noise. SOIRE fully supports interleaving and covers a large portion of regular expressions used in practice. Learning SOIREs is challenging because it requires heavy computation and text strings usually contain noise in practice. Most of the previous studies only learn restricted SOIREs and are not robust on noisy data. To tackle these issues, we propose a noise-tolerant differentiable learning approach SOIREDL for SOIRE. We design a neural network to simulate SOIRE matching and theoretically prove that certain assignments of the set of parameters learnt by the neural network, called faithful encodings, are one-to-one corresponding to SOIREs for a bounded size. Based on this correspondence, we interpret the target SOIRE from an assignment of the set of parameters of the neural network by exploring the nearest faithful encodings. Experimental results show that SOIREDL outperforms the state-of-the-art approaches, especially on noisy data.
翻译:我们研究从一套可能带有噪音的文本字符串中学习一个单一的定期表达式的问题。 SOIRE 完全支持插入并覆盖实践中使用的很大一部分常规表达式。 学习 SOIRE 具有挑战性, 因为它需要大量计算, 文本字符串通常含有实际中的噪音。 以往的研究大多只学习有限的 SOIRE, 并且对吵闹的数据不强。 为了解决这些问题, 我们为SOIRE 设计了一个不动的可异学习方法 SOIREDL 。 我们设计了一个神经网络, 模拟SOIRE 匹配, 并在理论上证明神经网络所学的一组参数( 称为忠实编码) 的某些分配是一对一的, 与SOIRE 相对应, 其尺寸受约束。 基于这一通信, 我们从神经网络一系列参数的指定中解释 SOIRE, 探索最近的可靠编码。 实验结果显示 SOIREDL 超越了最先进的方法, 特别是热调数据。