Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying news corpus. Questions are taken from forecasting tournaments, ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future). Motivated by the difficulty of forecasting numbers across orders of magnitude (e.g. global cases of COVID-19 in 2022), we also curate IntervalQA, a dataset of numerical questions and metrics for calibration. We test language models on our forecasting task and find that performance is far below a human expert baseline. However, performance improves with increased model size and incorporation of relevant information from the news corpus. In sum, Autocast poses a novel challenge for large language models and improved performance could bring large practical benefits.
翻译:预测未来世界事件是一项具有挑战性但宝贵的任务。对气候、地缘政治冲突、流行病和经济指标的预测有助于形成政策和决策。在这些领域,专家人类的判断有助于最佳预测。鉴于语言建模方面的进展,这些预测可以自动化吗?为此,我们引入了Autocast,一个包含数千个预测问题的数据集,以及一个附带的新闻资料库。从预测比赛中可以找到问题,确保高质量、真实世界重要性和多样性。新闻资料按日期编排,使我们能够准确地模拟人类过去预测(避免未来渗漏)的条件。由于难以预测不同数量级的数字(例如2022年全球COVID-19案例),我们还将IntervalQA,一个数字问题和校准指标数据集加以整理。我们测试了我们预测任务的语言模型,发现业绩远远低于人类专家基线。然而,随着模型规模的扩大和新闻资料的整合,业绩有所改善。总而言之,Autocast为大型语言模型提出了新的挑战,业绩的改善可以带来巨大的实际效益。