了解模型对用户生成的吵闹文字的强度 (Understanding Model Robustness to User-generated Noisy Texts)

Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage artificially noised data. However, the amount and type of generated noise has so far been determined arbitrarily. We therefore propose to model the errors statistically from grammatical-error-correction corpora. We present a thorough evaluation of several state-of-the-art NLP systems' robustness in multiple languages, with tasks including morpho-syntactic analysis, named entity recognition, neural machine translation, a subset of the GLUE benchmark and reading comprehension. We also compare two approaches to address the performance drop: a) training the NLP models with noised data generated by our framework; and b) reducing the input noise with external system for natural language correction. The code is released at https://github.com/ufal/kazitext.

翻译：深神经模型对输入噪音的感知度被认为是一个具有挑战性的问题。在国家语言方案中,模型性能往往会随着自然产生的噪音而恶化,例如拼写错误。为了缓解这一问题,模型可能会利用人工破译的数据。然而,迄今产生的噪音的数量和类型已经任意确定。因此,我们提议从语法-error-校正组合体中进行统计性模型错误。我们用多种语言对一些最先进的NLP系统是否稳健进行彻底评估,任务包括模光速合成分析、名称实体识别、神经机器翻译、GLUE基准的一个子和阅读理解。我们还比较了两种方法来解决性能下降问题:a)用我们框架产生的无记名数据对NLP模型进行培训;b)用外部系统减少自然语言校正的输入噪音。该代码在https://github.com/ufal/kazitext上发布。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

53+阅读 · 2021年1月20日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日