We explore the problem of predicting the publication period of text document, such as a news article, using the text from that document. In order to do so, we created our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades. We then provide an implementation of a simple Naive Bayes baseline model, which surprisingly achieves decent performance in terms of accuracy.Finally, for our approach, we use a pretrained BERT model fine-tuned for the task of text classification. This model exceeds our expectations and provides some very impressive results in terms of accurately classifying news articles into their respective publication decades. The results beat the performance of the few previously tried models for this relatively unexplored task of time prediction from text.
翻译:我们研究使用文章文本来预测发布时间的问题,例如新闻文章。为此,我们创建了自己的大型标签数据集,其中包括《纽约时报》六十年来发布的超过350,000篇文章。接着,我们提供了一个简单的朴素贝叶斯基准模型的实现,令人惊讶的是它在准确性方面表现出色。最后,我们使用一个预训练的BERT模型进行微调来实现我们的方法,这个模型超出了我们的预期,并在准确地将新闻文章分类到它们各自的出版年代方面提供了一些非常惊人的结果。结果超过了先前为这一相对未被探索的文本时间预测任务尝试的少数模型在准确性上的表现。