迈向主动多分布学习的基本极限 (Towards Fundamental Limits for Active Multi-distribution Learning)

Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, $\{D_i\}_{i\in[k]}$, is considered and a classifier's performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of $\widetilde{O}\Bigl(\theta_{\max}(d+k)\ln\frac{1}{\varepsilon}\Bigr)$ and $\widetilde{O}\Bigl(\theta_{\max}(d+k)\Bigl(\ln\frac{1}{\varepsilon}+\frac{\nu^2}{\varepsilon^2}\Bigr)+\frac{k\nu}{\varepsilon^2}\Bigr)$ in the realizable and agnostic settings respectively, where $\theta_{\max}$ is the maximum disagreement coefficient among the $k$ distributions, $d$ is the VC dimension of the hypothesis class, $\nu$ is the multi-distribution error of the best hypothesis, and $\varepsilon$ is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the $k\nu/\varepsilon^2$ term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes~\citep{blum2017collaborative,zhang2024optimal}, which may be of independent interest.

翻译：多分布学习将不可知近似正确（PAC）学习扩展到考虑一个包含$k$个分布$\{D_i\}_{i\in[k]}$的族，并以分类器在最差分布下的错误来衡量其性能的场景。由于其在协作学习、公平性和鲁棒性方面的应用，该问题近来引起了广泛关注。尽管被动多分布学习的样本复杂度已得到较为完整的刻画，但主动多分布学习的研究仍然稀少，现有算法的优劣性尚不明确。本文为主动多分布学习开发了新算法，并在分布依赖和分布无关两种设定下，建立了改进的标签复杂度上界与下界。具体而言，在近可实现设定中，我们分别在可实现和不可知设定下证明了$\widetilde{O}\Bigl(\theta_{\max}(d+k)\ln\frac{1}{\varepsilon}\Bigr)$和$\widetilde{O}\Bigl(\theta_{\max}(d+k)\Bigl(\ln\frac{1}{\varepsilon}+\frac{\nu^2}{\varepsilon^2}\Bigr)+\frac{k\nu}{\varepsilon^2}\Bigr)$的上界，其中$\theta_{\max}$是$k$个分布中的最大分歧系数，$d$是假设类的VC维，$\nu$是最优假设的多分布误差，$\varepsilon$是目标超额误差。此外，我们证明了可实现设定下的界在信息论意义下是最优的，并且不可知设定中的$k\nu/\varepsilon^2$项对于适定学习器是本质性的。我们还建立了被动多分布学习的实例依赖样本复杂度界，该界平滑地插值于可实现与不可知两种机制之间~\citep{blum2017collaborative,zhang2024optimal}，这可能具有独立的研究价值。