We present TwHIN-BERT, a multilingual language model trained on in-domain data from the popular social network Twitter. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on a variety of multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We will freely open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.
翻译:我们介绍TwHIN-BERT,这是一个多语种的多语种模式,从广受欢迎的社会网络Twitter上获得关于域内数据的训练。 TwHIN-BERT不同于以前受过训练的语言模式,因为它不仅受过基于文本的自我监督培训,而且受到基于Twitter多种信息网络(TwHIN)中丰富社会参与的社会目标的培训。我们的模式是70亿条推特,涵盖100多种不同语言,为模拟简短、吵闹、用户生成的文本提供了宝贵的代表性。我们评估了我们关于多种多语种社会建议和语义理解任务的模式,并展示了与已有的基于文本的预先培训的语言模式相比的重大指标改进。我们将自由地向研究界开放TwHIN-BERT和我们整理的标签预测和社会参与基准数据集。