Time is an important aspect of documents and is used in a range of NLP and IR tasks. In this work, we investigate methods for incorporating temporal information during pre-training to further improve the performance on time-related tasks. Compared with BERT which utilizes synchronic document collections (BooksCorpus and English Wikipedia) as the training corpora, we use long-span temporal news article collection for building word representations. We introduce TimeBERT, a novel language representation model trained on a temporal collection of news articles via two new pre-training tasks, which harness two distinct temporal signals to construct time-aware language representations. The experimental results show that TimeBERT consistently outperforms BERT and other existing pre-trained models, with substantial gains on different downstream NLP tasks or applications for which time is of high importance.
翻译:时间是文件的一个重要方面,用于一系列NLP和IR任务。在这项工作中,我们调查了在培训前纳入时间信息的方法,以进一步改进与时间有关的任务的绩效。与使用同步文件收藏(BooksCorpus和English Wikipedia)作为培训公司的BERT相比,我们使用长篇时间新闻文章集来制作文字演示。我们引入了TeleBERT,这是一个新型语言代表模式,通过两项新的培训前任务,通过时间收集新闻文章来培训,利用两个不同的时间信号来构建有时间意识的语言演示。实验结果表明,TimeBERT始终比现有的预先培训模式(BERT和其他模式)更完善,在不同的下游NLP任务或应用程序上取得了巨大进展,而时间非常重要。