Self-supervised learning (SSL) techniques have been widely used to learn compact and informative representations from high-dimensional complex data. In many computer vision tasks, such as image classification, such methods achieve state-of-the-art results that surpass supervised learning approaches. In this paper, we investigate whether SSL methods can be leveraged for the task of learning accurate state representations of games, and if so, to what extent. For this purpose, we collect game footage frames and corresponding sequences of games' internal state from three different 3D games: VizDoom, the CARLA racing simulator and the Google Research Football Environment. We train an image encoder with three widely used SSL algorithms using solely the raw frames, and then attempt to recover the internal state variables from the learned representations. Our results across all three games showcase significantly higher correlation between SSL representations and the game's internal state compared to pre-trained baseline models such as ImageNet. Such findings suggest that SSL-based visual encoders can yield general -- not tailored to a specific task -- yet informative game representations solely from game pixel information. Such representations can, in turn, form the basis for boosting the performance of downstream learning tasks in games, including gameplaying, content generation and player modeling.
翻译:自我监督的学习技术( SSL) 已被广泛用于从高维复杂数据中学习压缩和信息化的演示。 在许多计算机视觉任务中, 如图像分类等, 此类方法可以实现最新艺术结果, 超过监管的学习方法。 在本文中, 我们调查是否可以利用 SSL 方法来学习游戏的准确状态演示任务, 如果可以的话, 在多大程度上。 为此, 我们从三个不同的 3D 游戏( VizDomoom 、 CARLA 竞赛模拟器和 Google 研究足球环境) 中收集游戏内部状态的游戏画面框架和相应序列。 我们用三种广泛使用的 SSL 算法来训练一个图像编码器, 其中三种使用的是纯原始框架, 然后试图从所学的演示中恢复内部状态变量。 我们所有三个游戏的结果显示 SSL 表达方式和游戏的内部状态与图像网络等经过预先训练的基线模型相比, 。 这样的研究结果显示, SSL 的视觉编码器可以产生一般的 -- 而不是根据特定任务定制的 -- 并且信息显示游戏显示游戏的游戏显示只是游戏的原始框架信息, 。 这些显示游戏的游戏的游戏的游戏内容可以转换,, 游戏的游戏的游戏的游戏的游戏的制作者在下进行学习基础, 。