In this paper, we study the MNL-Bandit problem in a non-stationary environment and present an algorithm with worst-case dynamic regret of $\tilde{O}\left( \min \left\{ \sqrt{NTL}\;,\; N^{\frac{1}{3}}(\Delta_{\infty}^{K})^{\frac{1}{3}} T^{\frac{2}{3}} + \sqrt{NT}\right\}\right)$. Here $N$ is the number of arms, $L$ is the number of switches and $\Delta_{\infty}^K$ is a variation measure of the unknown parameters. We also show that our algorithm is near-optimal (up to logarithmic factors). Our algorithm builds upon the epoch-based algorithm for stationary MNL-Bandit in Agrawal et al. 2016. However, non-stationarity poses several challenges and we introduce new techniques and ideas to address these. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
翻译:在本文中,我们研究了非静止环境中的MNL-Bandit问题,并提出了一个最差情况动态后悔$tilde{O<unk> left(min\left\\\\ sqrt{NTL});\\;\;N<unk> frac{1<unk> 3}(\\Delta<unk> infty}K}){{1<unk> 3}{1<unk> 3<unk> T<unk> frac{2<unk> 3<unk> +\sqrt{NT<unk> rt{right}right$的算法。这里是武器的数量,$L$是开关的数量,$\Delta}infty}K$是未知参数的变异度。我们还表明我们的算法接近最佳(最高为对数因素 ) 。 我们的算法建立在基于阿格拉瓦尔等人的基于恒定的 MNNNL-Banditi算法的以尿算法基础上的算法。然而,不透明性提出了几项挑战,我们提出了解决这些问题的新技术和新想法。我们特别对由于新的制式和新制式,在静态中引入了对定的集中器中引入的偏差的偏差进行了精确的定性。</s>