In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists \cite{wang2017improving}, our algorithm achieves $\tilde{\mathcal{O}}(\sqrt{\mathcal{S} T})$ distribution-dependent regret in the switching case, and $\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the dynamic case, where $\mathcal S$ is the number of switchings and $\mathcal V$ is the sum of the total ``distribution changes''. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter $\mathcal S$ or $\mathcal V$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters $\mathcal S$ or $\mathcal V$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.
翻译:在本文中,我们调查了非静止的组合组合半弯状问题, 包括切换案例和动态案例。 在一般情况下, (a) 奖赏功能是非线性功能, (b) 武器可能是概率触发的, (c) 只有近似离线质存在\cite{Wang2017改善} (c), 我们的算法达到了$\tilde{O{( sqrt=mathcal{S}T}), 在切换案例和动态案例中, 以分配为依存的 美元为基准的 。 在一般情况下, (a) 奖赏功能是非线性, (b) 武器可能是概率触发的, (b) 和 (c) 只有离线性果实的, (c) 和 美元是整数 。 我们进一步表明, 在另一个案子中, 我们的算法算法, 或直径直线性 函数是 。