Contextual bandits are widely-used in the study of learning-based control policies for finite action spaces. While the problem is well-studied for bandits with perfectly observed context vectors, little is known about the case of imperfectly observed contexts. For this setting, existing approaches are inapplicable and new conceptual and technical frameworks are required. We present an implementable posterior sampling algorithm for bandits with imperfect context observations and study its performance for learning optimal decisions. The provided numerical results relate the performance of the algorithm to different quantities of interest including the number of arms, dimensions, observation matrices, posterior rescaling factors, and signal-to-noise ratios. In general, the proposed algorithm exposes efficiency in learning from the noisy imperfect observations and taking actions accordingly. Enlightening understandings the analyses provide as well as interesting future directions it points to, are discussed as well.
翻译:在研究基于学习的有限行动空间控制政策时,广泛使用背景强盗;虽然这个问题对于有完全观测到的环境矢量的强盗来说研究得很透彻,但对不完全观测到的环境环境的情况却知之甚少;对于这种环境,现有办法不适用,需要新的概念和技术框架;我们为背景观测不完善的强盗提出可执行的事后抽样算法,并研究其学习最佳决策的绩效;所提供的数字结果将算法的性能与不同数量的兴趣联系起来,包括武器数量、尺寸、观察矩阵、后继力调整因素和信号对音比。一般而言,拟议的算法暴露了从杂乱不完善的观察中学习并相应采取行动的效率。还讨论了如何理解这些分析所提供的未来方向。