Transfer Learning is an area of statistics and machine learning research that seeks answers to the following question: how do we build successful learning algorithms when the data available for training our model is qualitatively different from the data we hope the model will perform well on? In this thesis, we focus on a specific area of Transfer Learning called label shift, also known as quantification. In quantification, the aforementioned discrepancy is isolated to a shift in the distribution of the response variable. In such a setting, accurately inferring the response variable's new distribution is both an important estimation task in its own right and a crucial step for ensuring that the learning algorithm can adapt to the new data. We make two contributions to this field. First, we present a new procedure called SELSE which estimates the shift in the response variable's distribution. Second, we prove that SELSE is semiparametric efficient among a large family of quantification algorithms, i.e., SELSE's normalized error has the smallest possible asymptotic variance matrix compared to any other algorithm in that family. This family includes nearly all existing algorithms, including ACC/PACC quantifiers and maximum likelihood based quantifiers such as EMQ and MLLS. Empirical experiments reveal that SELSE is competitive with, and in many cases outperforms, existing state-of-the-art quantification methods, and that this improvement is especially large when the number of test samples is far greater than the number of train samples.
翻译:转移学习是统计和机器学习研究的一个领域,需要回答以下问题:当用于培训我们模型的数据与我们希望模型能够良好运行的数据质量不同时,我们如何建立成功的学习算法?在这个论文中,我们侧重于一个名为“标签变化”的转移学习的具体领域,也称为量化。在量化方面,上述差异被孤立到响应变量分布的转变。在这种环境下,准确推断响应变量的新分布是其自身的重要估计任务,也是确保学习算法能够适应新数据的关键步骤。我们为这个领域做出了两项贡献。首先,我们提出了一个名为“SELSIE”的新程序,用于估计响应变量分布的变化。第二,我们证明SELSI在大量量化算法中是半偏差效率的,即SELSI的归正错误与该家庭任何其他算法相比,其数量最小,与任何其他算法相比,这个家庭几乎包括所有现有的算法,包括AC/PACC量化算法的量化算法,以及以最大可能的方式对响应变量分布进行大幅的测试。