Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-standard text, usually from social media, into a more standardized form. In this work, we propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem. As the noisy text is a pervasive problem across languages, not just English, we leverage the multi-lingual pre-training of mBART to fine-tune it to our data. While current approaches mainly operate at the word or subword level, we argue that this approach is straightforward from a technical standpoint and builds upon existing pre-trained transformer networks. Our results show that while word-level, intrinsic, performance evaluation is behind other methods, our model improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.
翻译:自然语言处理的现行基准任务含有与日常非正式数字通信中所使用的文本质量不同的案文。 这种差异导致在对真实世界数据进行微调时,最先进的NLP模型的性能严重退化。 解决这一问题的一个办法是通过词汇正常化,即通常从社交媒体将非标准文本转换成更标准化的形式。 在这项工作中,我们提议了一个基于MBART的句级顺序到顺序的模式,该模式将问题描述为机器翻译问题。由于吵闹文本是各种语言的普遍问题,而不仅仅是英语,我们利用MBART的多语言前期培训来微调数据。虽然目前的方法主要是在文字或子字级上运作,但我们认为这一方法从技术角度出发是简单易懂的,并且以现有的经过预先训练的变异网络为基础。 我们的结果表明,尽管字级、内在和业绩评价落后于其他方法,但我们的模式通过与原始、未处理的社会媒体文本的模型相比,改进了外向下游任务的绩效。