Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it.
翻译:语言模式培训前的近期进展导致在语言模式培训前的最近进展,导致实体识别(NER)方面有了重大改进。然而,这一进展主要是在新闻、维基百科或科学文章等结构完善的文件中测试的。在社交媒体中,情况不同,由于动荡和动态性质,增加了另一层复杂因素。在本文中,我们侧重于Twitter中的NER,这是最大的社交媒体平台之一,并建立了一个新的NER数据集,TweetNER7,其中包含7个实体类型,在2019年9月至2021年8月的11 382次推特上附加了附加说明。数据集的构建方式是:在时间上仔细分发推文,并以具有代表性的趋势为基础。除了数据集之外,我们还提供了一套语言模式基线,并对任务的语言模式绩效进行了分析,特别是分析不同时期的影响。我们的分析侧重于三个重要的时间方面:NER模型的短期退化,在不同时期对语言模式进行微调的战略,以及在不同时期对语言模式进行自我贴标签,以替代缺乏最新标签的数据。TweetNER7/saddaldas/s。