Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (\url{https://github.com/GEM-benchmark/NL-Augmenter}).
翻译:增强数据是自然语言处理模型(NLP)的稳健性评估的重要组成部分,也是增强所培训数据多样性的重要组成部分。本文介绍NL-Augmenter,这是一个新的参与性的基于Python的自然语言增强框架,它支持建立转换(对数据进行修改)和过滤器(数据根据具体特点进行分解)。我们描述了框架以及117个初始转换和23个过滤器,用于各种自然语言任务。我们通过利用这些转换分析流行的自然语言模型的稳健性,展示了NL-Augmenter的功效。基础设施、数据卡和稳健性分析结果公布在NL-Augmenter存储库(https://github.com/GEM-benchmark/NL-Augmenter})上。