To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies.
翻译:为了概括各种任务,代理人应当从过去的任务中获取知识,促进适应和探索未来任务。我们注重的是内置适应和探索问题,即代理人只依赖背景,即国家历史、行动和(或)奖励,而不是基于梯度的更新。 外观抽样(Thompson抽样的扩展)是一种有希望的方法,但需要贝叶斯人的推断和动态编程,这往往涉及未知因素(例如以前的一个)和昂贵的计算。为了解决这些困难,我们使用变压器从培训任务中学习推论过程,并考虑部分模型的假设空间,作为小马尔科夫决定过程,对动态的编程来说是廉价的。在我们版本的符号化分析基准中,我们的方法适应速度和探索-利用平衡方法,就是精确的后方取样或触法。我们还表明,即使部分模型排除了环境的相关信息,它们仍然可以导致良好的政策。