高层面在线行动学习:保守观点 (Online Action Learning in High Dimensions: A Conservative Perspective)

Sequential learning problems are common in several fields of research and practical applications. Examples include dynamic pricing and assortment, design of auctions and incentives and permeate a large number of sequential treatment experiments. In this paper, we extend one of the most popular learning solutions, the $\epsilon_t$-greedy heuristics, to high-dimensional contexts considering a conservative directive. We do this by allocating part of the time the original rule uses to adopt completely new actions to a more focused search in a restrictive set of promising actions. The resulting rule might be useful for practical applications that still values surprises, although at a decreasing rate, while also has restrictions on the adoption of unusual actions. With high probability, we find reasonable bounds for the cumulative regret of a conservative high-dimensional decaying $\epsilon_t$-greedy rule. Also, we provide a lower bound for the cardinality of the set of viable actions that implies in an improved regret bound for the conservative version when compared to its non-conservative counterpart. Additionally, we show that end-users have sufficient flexibility when establishing how much safety they want, since it can be tuned without impacting theoretical properties. We illustrate our proposal both in a simulation exercise and using a real dataset.

翻译：一些研究和实际应用领域都常见有顺序的学习问题,例如动态定价和分类、拍卖和奖励的设计、以及大量连续治疗实验。在本文件中,我们把最受欢迎的学习解决方案之一,即$\epsilon_t$greedy heuristics,扩大到高维环境,同时考虑到保守的指令。我们这样做的方法是,将原始规则使用的部分时间用于采取全新行动,以更集中地寻找一套有希望的限制性行动。由此产生的规则对于实际应用可能有用,这些应用仍然价值惊喜,尽管速度下降,同时对采取不寻常行动也有限制。我们很有可能找到一个最受欢迎的学习解决方案,即$\epsilon_t$t$greedy seredy rucism。此外,我们为一套可行的行动提供了较低的界限,即与非保守的对应行动相比,意味着对保守版本有更好的遗憾约束。此外,我们表明,最终用户在确定安全度时有足够的灵活性,在确定如何进行模拟时,我们既要进行真正的模拟,又不要求进行真正的数据。