In molecular and biological sciences, experiments are expensive, time-consuming, and often subject to ethical constraints. Consequently, one often faces the challenging task of predicting desirable properties from small data sets or scarcely-labeled data sets. Although transfer learning can be advantageous, it requires the existence of a related large data set. This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme are integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder, in order to deal with scarcely-labeled data sets. In addition, a consensus technique is detailed. The proposed models are validated using five benchmark data sets. We also provide a thorough comparison to other competing methods, such as support vector machines, random forests, and gradient boosting decision trees, which are known for their good performance on small data sets. The performances of various methods are analyzed using residue-similarity (R-S) scores and R-S indices. Extensive computational experiments and theoretical analysis show that the new models perform very well even when as little as 1% of the data set is used as labeled data.
翻译:在分子和生物科学中,实验费用昂贵,耗时费时,而且往往受到伦理限制。因此,人们往往面临从小数据集或标签很少的数据集预测理想特性的艰巨任务,从小数据集或标签很少的数据集中预测理想特性。虽然转让学习可能是有利的,但需要有一个相关的大型数据集。这项工作引入了三种基于图表的模型,其中包括Merriman-Bence-Osher(MCO)技术,以迎接这一挑战。具体地说,以图表为基础的对MBO办法的修改与最新技术相结合,包括自制变压器和自动编码器,以便处理很少贴标签的数据集。此外,还详细介绍了协商一致技术。拟议的模型是使用五个基准数据集验证的。我们还提供了与其他竞争方法的彻底比较,例如支持矢量机、随机森林和梯度增强决策树,这些方法在小数据集上表现良好。各种方法的绩效都用残留类(R-S)分数和R-S指数来分析。广泛的计算实验和理论分析表明,新模型使用的数据非常有效,即使作为数据集也使用。