机器学习正则语言基准测试：MLRegTest (MLRegTest: A Benchmark for the Machine Learning of Regular Languages)

Sam van der Poel,Dakotah Lambert,Kalina Kostyszyn,Tiantian Gao,Rahul Verma,Derek Andersen,Joanne Chau,Emily Peterson,Cody St. Clair,Paul Fodor,Chihiro Shibata,Jeffrey Heinz

from arxiv, 38 pages, MLRegTest benchmark available at the OSF at https://osf.io/ksdnm , associated code at https://github.com/heinz-jeffrey/subregular-learning

Evaluating machine learning (ML) systems on their ability to learn known classifiers allows fine-grained examination of the patterns they can learn, which builds confidence when they are applied to the learning of unknown classifiers. This article presents a new benchmark for ML systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that their performance depends significantly on the kind of test set, the class of language, and the neural network architecture.

翻译：评估机器学习（ML）系统在学习已知分类器方面的能力，可以细致地检查它们能够学习的模式，这在将它们应用于学习未知分类器时可以建立信心。本文介绍了一种对序列分类的机器学习系统的新基准测试，称为MLRegTest，其中包含来自1,800个正则语言的训练、开发和测试集。不同类型的形式化语言代表不同类型的长距离依赖性，正确地识别序列中的长距离依赖性是机器学习系统成功推广的已知挑战。MLRegTest根据它们的逻辑复杂性（至于二阶单调递增，一阶，命题或单项式表达式）和逻辑文字类型（字符串、层字符串、子序列或两者的组合）组织语言。逻辑复杂性和文字的选择提供了一种系统的方法来理解正则语言中不同类型的长距离依赖关系，因此理解不同机器学习系统学习这样的长距离依赖关系的能力。最后，该文还研究了不同神经网络（简单的RNN、LSTM、GRU、transformer）在MLRegTest上的表现。主要结论是它们的表现显着取决于测试集的类型、语言类别和神经网络架构。