We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.
翻译:我们研究从一组强化学习环境共享的极大动作空间中选择一个小的代表性动作子集的问题——这是在库存管理和推荐系统等应用中面临的根本性挑战,其中直接在整个空间上进行学习是不可行的。我们的目标是确定一个固定的动作子集,使得对于该族中的每个环境,该子集都包含一个接近最优的动作,从而无需穷举评估所有动作即可实现高效学习。这项工作将我们先前关于元赌博机的研究结果推广到更一般的马尔可夫决策过程(MDPs)设定中。我们证明了我们现有的算法能够达到与使用完整动作空间相当的性能。这一理论保证是在一个放宽的、非中心化的亚高斯过程模型下建立的,该模型能够适应更大的环境异质性。因此,我们的方法为不确定性下的大规模组合决策问题提供了一种计算和样本高效的解决方案。