Exploration methods based on pseudo-count of transitions or curiosity of dynamics have achieved promising results in solving reinforcement learning with sparse rewards. However, such methods are usually sensitive to environmental dynamics-irrelevant information, e.g., white-noise. To handle such dynamics-irrelevant information, we propose a Dynamic Bottleneck (DB) model, which attains a dynamics-relevant representation based on the information-bottleneck principle. Based on the DB model, we further propose DB-bonus, which encourages the agent to explore state-action pairs with high information gain. We establish theoretical connections between the proposed DB-bonus, the upper confidence bound (UCB) for linear case, and the visiting count for tabular case. We evaluate the proposed method on Atari suits with dynamics-irrelevant noises. Our experiments show that exploration with DB bonus outperforms several state-of-the-art exploration methods in noisy environments.
翻译:以变化或动态好奇心为假算法的探索方法在以微薄的回报解决强化学习方面取得了令人乐观的成果,然而,这些方法通常对环境动态相关信息,例如白噪音敏感。为了处理这种动态相关信息,我们提议了一个动态博特内克(DB)模型,该模型基于信息瓶颈原则实现动态相关代表制。根据DB模型,我们进一步提议DB-bonus,该模型鼓励该代理商探索信息收益高的州-行动对口。我们在拟议的DB-bonus、线性案例的上层信任约束(UCB)和表格案例的访问计数之间建立理论联系。我们评估了与动态瓶颈相关的阿塔里套装和动态噪音的拟议方法。我们的实验表明,与DB奖金的探索在噪音环境中优于若干州级的探索方法。