基于高码率量化理论的视角：多零计算专家的混合模型 (Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective)

This paper uses classical high-rate quantization theory to provide new insights into mixture-of-experts (MoE) models for regression tasks. Our MoE is defined by a segmentation of the input space to regions, each with a single-parameter expert that acts as a constant predictor with zero-compute at inference. Motivated by high-rate quantization theory assumptions, we assume that the number of experts is sufficiently large to make their input-space regions very small. This lets us to study the approximation error of our MoE model class: (i) for one-dimensional inputs, we formulate the test error and its minimizing segmentation and experts; (ii) for multidimensional inputs, we formulate an upper bound for the test error and study its minimization. Moreover, we consider the learning of the expert parameters from a training dataset, given an input-space segmentation, and formulate their statistical learning properties. This leads us to theoretically and empirically show how the tradeoff between approximation and estimation errors in MoE learning depends on the number of experts.

翻译：本文运用经典的高码率量化理论，为回归任务中的专家混合模型提供了新的理论视角。我们定义的MoE模型通过对输入空间进行区域划分来实现，每个区域对应一个单参数专家，该专家在推理时作为零计算量的常数预测器。受高码率量化理论基本假设的启发，我们假设专家数量足够多，使得其对应的输入空间区域非常小。基于此，我们研究了该MoE模型类的逼近误差：(i) 对于一维输入，我们推导了测试误差的表达式及其最优的区域划分与专家参数配置；(ii) 对于多维输入，我们建立了测试误差的上界并分析了其最小化条件。此外，在给定输入空间划分的前提下，我们研究了基于训练数据的专家参数学习问题，并推导了其统计学习性质。由此，我们从理论与实验两方面揭示了MoE学习中逼近误差与估计误差之间的权衡如何依赖于专家数量。