Recent breakthroughs in the development of agents to solve challenging sequential decision making problems such as Go, StarCraft, or DOTA, have relied on both simulated environments and large-scale datasets. However, progress on this research has been hindered by the scarcity of open-sourced datasets and the prohibitive computational cost to work with them. Here we present the NetHack Learning Dataset (NLD), a large and highly-scalable dataset of trajectories from the popular game of NetHack, which is both extremely challenging for current methods and very fast to run. NLD consists of three parts: 10 billion state transitions from 1.5 million human trajectories collected on the NAO public NetHack server from 2009 to 2020; 3 billion state-action-score transitions from 100,000 trajectories collected from the symbolic bot winner of the NetHack Challenge 2021; and, accompanying code for users to record, load and stream any collection of such trajectories in a highly compressed form. We evaluate a wide range of existing algorithms including online and offline RL, as well as learning from demonstrations, showing that significant research advances are needed to fully leverage large-scale datasets for challenging sequential decision making tasks.
翻译:Go、StarCraft或DOTA等机构在开发解决具有挑战性的连续决策问题的代理人方面的近期突破,依赖模拟环境和大规模数据集,然而,由于公开源码数据集的缺乏以及与之合作的计算成本过高,这一研究的进展受到阻碍。这里我们展示了NetHack学习数据集(NLD),这是来自NetHack流行游戏(NetHack)的大规模和高度可扩缩的轨迹数据集,对于目前的方法来说,这都是极具挑战性的,而且运行速度非常快。全国民主联盟由三部分组成:从2009年至2020年在NAO公共NetHack服务器上收集的150万个人类轨迹从150万个州级转换到150万个州级转换;从NetHack挑战2021号象征性赢家收集的100 000个轨迹上30亿个州级行动核心转换,显示用户以高度压缩的形式记录、装载和流传任何此类轨迹的代码。我们评估了广泛的现有算法,包括在线和离线式RL,以及从从具有挑战性的连续分析需要的大规模进展,以充分显示从具有挑战性的连续优势的研究。