For many real-world applications, the user-generated inputs usually contain various noises due to speech recognition errors caused by linguistic variations1 or typographical errors (typos). Thus, it is crucial to test model performance on data with realistic input noises to ensure robustness and fairness. However, little study has been done to construct such benchmarks for Chinese, where various language-specific input noises happen in the real world. In order to fill this important gap, we construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises. READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input. We designed our annotation pipeline to maximize diversity, for example by instructing the annotators to use diverse input method editors (IMEs) for keyboard noises and recruiting speakers from diverse dialectical groups for speech noises. We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN even with robustness methods like data augmentation. As the first large-scale attempt in creating a benchmark with noises geared towards user-generated inputs, we believe that READIN serves as an important complement to existing Chinese NLP benchmarks. The source code and dataset can be obtained from https://github.com/thunlp/READIN.
翻译:对于许多现实世界应用程序,用户产生的输入通常含有因语言变异1 或印刷错误(typos)造成的语音识别错误而产生的各种噪音。因此,用现实输入噪音测试数据模型性能以确保稳健和公平性至关重要。然而,在为中文构建此类基准方面,没有做多少研究,因为各种语言特定输入噪音都发生在现实世界中。为了填补这一重要空白,我们建造了READIN:一个中国多任务基准,有弹性和多样化输入噪音。READIN包含四项不同任务,要求警告员用两种常用的中国输入方法:Pinyin输入和语音输入,重新输入原始测试数据。我们设计了我们的说明管道,以尽量扩大多样性,例如指示警告员使用多种输入方法编辑键盘噪音,从不同的辩词组招聘演讲者。我们用一系列强大的预先训练语言模型以及强有力的培训方法,我们发现这些模型常常在READIN上大量地下降,甚至用两种常用的输入方法: Pinininin 输入和语音数据库中,我们先用一个大规模的工作下降,然后用不断更新的 RADAD 数据库,例如不断建立数据。