In sequential decision-making scenarios i.e., mobile health recommendation systems revenue management contextual multi-armed bandit algorithms have garnered attention for their performance. But most of the existing algorithms are built on the assumption of a strictly parametric reward model mostly linear in nature. In this work we propose a new algorithm with a semi-parametric reward model with state-of-the-art complexity of upper bound on regret amongst existing semi-parametric algorithms. Our work expands the scope of another representative algorithm of state-of-the-art complexity with a similar reward model by proposing an algorithm built upon the same action filtering procedures but provides explicit action selection distribution for scenarios involving more than two arms at a particular time step while requiring fewer computations. We derive the said complexity of the upper bound on regret and present simulation results that affirm our methods superiority out of all prevalent semi-parametric bandit algorithms for cases involving over two arms.
翻译:在一系列决策假设中,即移动式保健建议系统收入管理,多武装强盗算法在其性能方面引起了注意。但大多数现有算法都是建立在假设严格的参数性奖赏模式的基础之上,其中大部分是线性性质的。在这项工作中,我们提出了一种新的算法,其半参数性奖赏模式具有最先进的复杂程度,在现有半参数性算法中,其上层的遗憾程度具有最先进的复杂程度。我们的工作扩大了另一个具有类似奖赏模式的现代复杂程度的代表性算法的范围,即提议一种基于同一行动过滤程序的算法,但为在特定时间步骤中涉及两个以上武器的情景提供明确的行动选择分布,同时要求较少的计算。我们得出了所述在遗憾上层的复杂程度,并提出模拟结果,确认我们的方法优于所有涉及两个武器的案件的流行的半参数性强算法。