Game engines are powerful tools in computer graphics. Their power comes at the immense cost of their development. In this work, we present a framework to train game-engine-like neural models, solely from monocular annotated videos. The result-a Learnable Game Engine (LGE)-maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. Similarly to a game engine, it models the logic of the game and the underlying rules of physics, to make it possible for a user to play the game by specifying both high- and low-level action sequences. Most captivatingly, our LGE unlocks the director's mode, where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents in the form of language and desired states. This requires learning "game AI", encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, devise the strategy to win a point. The key to learning such game AI is the exploitation of a large and diverse text corpus, collected in this work, describing detailed actions in a game and used to train our animation model. To render the resulting state of the environment and its agents, we use a compositional NeRF representation used in our synthesis model. To foster future research, we present newly collected, annotated and calibrated large-scale Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality. Besides, our LGEs unlock applications beyond capabilities of the current state of the art. Our framework, data, and models are available at https://learnable-game-engines.github.io/lge-website.
翻译:游戏引擎是计算机图形学中的强大工具。然而,它们的开发成本极高。在这项工作中,我们提出了一个框架,可以仅通过单目注释视频训练类似游戏引擎的神经模型。结果是一个可学习游戏引擎(LGE),它维护场景、物体和其中的代理状态,并允许从可控视点渲染环境。与游戏引擎类似,它模拟游戏的逻辑和物理规则,使用户可以通过指定高级和低级操作序列来进行游戏。最有趣的是,我们的LGE解锁了导演模式,游戏是通过幕后绘制来进行的,以自然语言的形式指定代理的高级动作和目标状态。这需要学习“游戏人工智能”,我们的动画模型对其进行封装,使用高级约束对场景进行导航,与对手对抗,设计获得积分的策略。学习这种游戏人工智能的关键是利用本文中收集的大量多样化的文本语料库,描述了游戏中的详细动作,并用于训练我们的动画模型。为了呈现环境及其代理的结果状态,我们使用组合的NeRF表示在合成模型中使用。为了促进未来的研究,我们提供了新收集的、注释的、校准的大规模的“网球”和“Minecraft”数据集。我们的方法在渲染质量方面明显优于现有的神经视频游戏模拟器。除此之外,我们的LGE还能解锁超越当前技术水平的其他应用。我们的框架、数据和模型都可在https://learnable-game-engines.github.io/lge-website 上获得。