We extend the adversarial/non-stochastic multi-play multi-armed bandit (MPMAB) to the case where the number of arms to play is variable. The work is motivated by the fact that the resources allocated to scan different critical locations in an interconnected transportation system change dynamically over time and depending on the environment. By modeling the malicious hacker and the intrusion monitoring system as the attacker and the defender, respectively, we formulate the problem for the two players as a sequential pursuit-evasion game. We derive the condition under which a Nash equilibrium of the strategic game exists. For the defender side, we provide an exponential-weighted based algorithm with sublinear pseudo-regret. We further extend our model to heterogeneous rewards for both players, and obtain lower and upper bounds on the average reward for the attacker. We provide numerical experiments to demonstrate the effectiveness of a variable-arm play.
翻译:我们把对抗/非随机多功能多武装匪徒(MPMAB)扩大到待玩的武器数量变化不定的情况,这项工作的动机是,在一个相互关联的运输系统中,用于扫描不同关键地点的资源随着时间和环境的变化而发生动态变化。我们把恶意黑客和入侵监测系统分别作为攻击者和防御者的模型,把这两个玩家的问题当作一个相继的追逐避险游戏。我们得出了战略游戏的纳什平衡存在的条件。对于辩护人来说,我们提供了一种基于指数加权的算法,配有亚线性假雷布雷特。我们进一步扩展了我们的模型,对两个玩家都给予不同程度的奖励,并获得了攻击者平均奖赏的下限和上限。我们提供了数字实验,以证明可变武器游戏的有效性。