Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: \url{https://github.com/social-machines/TweebankNLP}.
翻译:推特信息( tweets) 等社交媒体数据( 如 Twitter 信息) 给 NLP 系统带来了特殊的挑战, 因为它们的短、 吵闹和学术性质。 命名实体识别( NER ) 和合成分析等任务需要高域匹配的培训数据才能取得良好的表现。 虽然有些附加注释的推文数据集可供公众使用, 但它们都是为一次性解决一项任务而创建的。 但是, 还没有一个完整的综合分析( 例如, 部分语音标记、 依赖分析) 和 推文净化系统 的培训中心。 在本研究中, 我们的目标是创建以 Tweebbank- NER2 (TB2 2 ) 为基础的NER 网络识别( NNER ) 和 合成合成分析系统 。 我们用这些数据集来培训Twebanza NBLP 数据模型, 以及基于 NBLP 格式的SDR 数据模型 。 我们用新的基准、 Stanza NWL- Ralder- supal- supal salal com com commal- deal- deal ex the the Statal- sal- sal- sal- sal- sal- smal- smal exmal- sal- sal- sal exmal sal ex exmetal ex ex ex the semmmmmmmmlationals ex.