This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.
翻译:本文研究在时间性非常变和主题性差异性下,在双重不同环境中加强学习。在许多应用中,常见的做法是遇到系统动态产生的数据集,这些数据集可能随时间和人口变化而变化,对高质量的顺序决策构成挑战。然而,大多数现有RL解决方案需要的是时间固定性或主题同质性,如果两种假设都被违反,就会导致亚最佳政策。为了同时应对这两个挑战,我们提出原始算法,以确定“最佳数据块”,这些“最佳数据块”在时间上和在个人之间显示类似的动态,用于政策学习,这些动态在最新的变换点探测和群集识别之间交替交替。我们的方法是一般性的,与广泛的集群和变换点检测算法一起工作。它具有倍增力,因为它将多重初始估计值作为输入,而只需要其中之一来保持一致。此外,通过在时间和人口方面相互借鉴信息,它使我们能够探测较弱的信号,并在与每个时间或变化点检测对象应用的组合算法相比,具有更好的趋同性。我们通过真实的模拟和数据方法展示实用性。