In this paper, we study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such partial biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new \emph{distributed phase-then-batch-based elimination (\texttt{DPBE})} algorithm, which samples users in phases for collecting feedback to reduce the bias and employs \emph{maximum variance reduction} to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that \texttt{DPBE} achieves a sublinear regret of $\tilde{O}(T^{1-\alpha/2}+\sqrt{\gamma_T T})$, where $\alpha\in (0,1)$ is the user-sampling parameter one can tune. Moreover, \texttt{DPBE} can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various \emph{differential privacy} models (including the central, local, and shuffle models), we generalize \texttt{DPBE} to provide privacy guarantees for users participating in the distributed learning process. Finally, we conduct extensive simulations to validate our theoretical results and evaluate the empirical performance.
翻译:在本文中, 我们用分布有偏差的反馈来研究内脏强盗。 这个问题是由几个真实世界应用程序( 如动态定价、 蜂窝网络配置、 决策等) 驱动的。 在这些应用程序中, 大量用户的用户为中央实体所选择的行动提供了奖励, 但很难从所有用户那里收集反馈。 相反, 只能从一组用户那里获得有偏见的反馈( 由于用户差异性) 。 除了这种部分偏差的反馈之外, 我们还面临着两个实际挑战。 为了应对这些挑战, 我们仔细设计了一个新的 emph{ 分配的相异阶段( 如动态定价、 蜂窝网络配置配置、 政策化等) 。 通过正确选择阶段长度、 批量大小和用于消除次优化行动的信任宽度, 我们通过 text{ DPBE} 实现一个亚行底线级的递减 $\\ 平分级计算结果 ( text{ Erickral), 参与的用户行为方式是最后的缩数 。