Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence between training or validation dataset possessing labels for learning and testing a classifier (source domain) and a potentially large unlabeled dataset where the model is exploited (target domain). The task is to find such a common representation of both source and target datasets in which the source dataset is informative for training and such that the divergence between source and target would be minimized. Most popular solutions for domain adaptation are currently based on training neural networks that combine classification and adversarial learning modules, which are data hungry and usually difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) which finds a linear reduced data representation useful for solving the domain adaptation task. DAPCA is based on introducing positive and negative weights between pairs of data points and generalizes the supervised extension of principal component analysis. DAPCA represents an iterative algorithm such that at each iteration a simple quadratic optimization problem is solved. The convergence of the algorithm is guaranteed and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task, and also show the benefit of using DAPCA in the analysis of single cell omics datasets in biomedical applications. Overall, DAPCA can serve as a useful preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains.
翻译:现代机器学习的流行模式是适应领域,其目的在于解决培训或验证数据集之间存在差异的问题,这些培训或验证数据集拥有用于学习和测试一个分类器(源域)的标签,而且可能存在一个大型无标签数据集,用于利用模型(目标域),任务是找到源数据集和目标数据集的共同代表性,使源数据集和目标数据集对培训具有信息性,从而最大限度地缩小源数据集与目标之间的差异。目前,大多数广受欢迎的领域适应解决方案都基于培训神经网络,这些网络将分类和对抗性学习模块结合起来,而这些模块缺乏数据,而且通常难以培训。我们提出了一种方法,称为Domain适应主要构成部分分析(DAPCA),该方法发现线性数据代表减少有助于解决域适应任务。 DACCA基于在对数据点进行正负加权,并概括主要构成部分分析的监督扩展范围。 DACCA代表一种迭代算算法,因此在每次重复时,一个简单的四重优化问题得到解决。 算法的趋同,而且其数目在实践中也很小。 我们验证了D级调整方法中建议的关于用于解决域域内应用的标准化应用指标分析的多项基准。