Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
翻译:基于对在线决策的担忧,在每一步的每个步骤都产生过多的风险,本文件中我们提出了很可能在任何时间都安全时时会分心的组合组合半波形问题。在这个问题上,代理商可以选择从一组美元地面项目中选择一个规模的子集,最多为K美元。每个项目都与某种中值奖励有关,并且存在代表风险的差异。为了减轻该代理商所冒的风险,我们要求在整个时间范围内,在可能情况下至少需要1美元(delta)美元,该代理商作出的每一个选择应该包含差异总和不超过某种差异预算的物品。在这种限制下,我们要求该代理商选择从一个安全时段的限制。在这种限制下,我们设计和分析算法(sc PASCombUB})将时间前景上的遗憾降到最低。通过开发信息-理论下限,我们要求在整个时间范围内,在问题依赖和问题依赖的范式下,我们显示该代理商作出的每一项风险几乎都是在时间跨轨道上,我们所要选择的系统,我们所选择的系统是一个可应用的跨时间段,我们所选择的系统。