The problem of detecting an odd arm from a set of K arms of a multi-armed bandit, with fixed confidence, is studied in a sequential decision-making scenario. Each arm's signal follows a distribution from a vector exponential family. All arms have the same parameters except the odd arm. The actual parameters of the odd and non-odd arms are unknown to the decision maker. Further, the decision maker incurs a cost for switching from one arm to another. This is a sequential decision making problem where the decision maker gets only a limited view of the true state of nature at each stage, but can control his view by choosing the arm to observe at each stage. Of interest are policies that satisfy a given constraint on the probability of false detection. An information-theoretic lower bound on the total cost (expected time for a reliable decision plus total switching cost) is first identified, and a variation on a sequential policy based on the generalised likelihood ratio statistic is then studied. Thanks to the vector exponential family assumption, the signal processing in this policy at each stage turns out to be very simple, in that the associated conjugate prior enables easy updates of the posterior distribution of the model parameters. The policy, with a suitable threshold, is shown to satisfy the given constraint on the probability of false detection. Further, the proposed policy is asymptotically optimal in terms of the total cost among all policies that satisfy the constraint on the probability of false detection.
翻译:从一组多武装土匪的K型臂中探测奇臂的问题,在固定信心的情况下,在顺序决策的假设中研究从一组多武装土匪的K型臂中探测奇臂的问题。每个手臂的信号都遵循一个矢量指数式的分布。除奇臂外,所有手臂都有相同的参数。决策者不知道奇臂和非奇臂的实际参数。此外,决策者为从一个手臂转换到另一个手臂而付出了成本。这是一个顺序决策问题,即决策者对每个阶段的自然真实状态只得到有限的观察,但可以通过选择每个阶段的手臂来控制他的观点。感兴趣的是满足对误探测概率的一定限制的政策。首先确定关于总成本(可靠决定的预期时间加上总切换成本)的信息理论约束较低,然后研究基于普遍概率比率的顺序政策变化。由于矢量指数式的家庭假设,每个阶段的信号处理过程都非常简单,因为相关的组合使得在每一个阶段都能够容易地更新对误探测概率的概率进行限制。 最精确的测算模型显示最精确的精确性值是确定最精确的概率。