Low-dimensional embeddings for data from disparate sources play critical roles in multi-modal machine learning, multimedia information retrieval, and bioinformatics. In this paper, we propose a supervised dimensionality reduction method that learns linear embeddings jointly for two feature vectors representing data of different modalities or data from distinct types of entities. We also propose an efficient feature selection method that complements, and can be applied prior to, our joint dimensionality reduction method. Assuming that there exist true linear embeddings for these features, our analysis of the error in the learned linear embeddings provides theoretical guarantees that the dimensionality reduction method accurately estimates the true embeddings when certain technical conditions are satisfied and the number of samples is sufficiently large. The derived sample complexity results are echoed by numerical experiments. We apply the proposed dimensionality reduction method to gene-disease association, and predict unknown associations using kernel regression on the dimension-reduced feature vectors. Our approach compares favorably against other dimensionality reduction methods, and against a state-of-the-art method of bilinear regression for predicting gene-disease associations.
翻译:用于不同来源数据的低维嵌入器在多式机器学习、多媒体信息检索和生物信息学中发挥着关键作用。 在本文中,我们提议了一种监督的维度减少方法,用于为两种特征矢量共同学习线性嵌入器,这两种特性矢量代表不同模式的数据或不同类型实体的数据。我们还提议了一种高效的特征选择方法,补充并可在我们共同的维度减少方法之前应用。假设这些特征存在真正的线性嵌入器,我们对所学线性嵌入器错误的分析提供了理论保证,即在满足某些技术条件和样本数量足够大的情况下,维度减少方法准确估计了真实嵌入器。衍生的样本复杂性结果通过数字实验得到回响。我们将拟议的维度减少方法应用于基因不稳定联系,并预测使用维度降特性矢量矢量回归法的未知关联。我们的方法优于其他维度减少方法,也优于预测基因分裂协会的双线回归状态方法。