We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
翻译:我们建议一个新的学习框架, 捕捉许多真实世界用户互动应用程序的分层结构 。 用户可以基于对勘探风险的不同容忍度, 分为两个组, 并且应该分别对待。 在此背景下, 我们同时维持两个政策 $\ p ⁇ text{O} $ 和$\ p ⁇ text{E}} 美元 : $\\ pí text{O} (O) 与第一个层次的更多风险容忍用户互动, 并按常规平衡勘探和开发, 最大限度地减少遗憾, 而 $\\\ text{ E} (用于“ 开发” 的“ E” ) 专门侧重于为第二层次的风险偏向用户开发。 在此背景下, 我们同时保留两种政策, 这样的分离是否在标准在线设置上产生优势 : $\ pí text{ { { { { } { { { { { { } (O} 美元 。 我们单独考虑 最不依赖 美元 的数值 和 美元 直观环境 。 。 。 。 。 $} 。 对于前者, 我们证明这种分离确实不有利于 微 美元 的 的 。 K_ 的 。