In recent decades, science and engineering have been revolutionized by a momentous growth in the amount of available data. However, despite the unprecedented ease with which data are now collected and stored, labeling data by supplementing each feature with an informative tag remains to be challenging. Illustrative tasks where the labeling process requires expert knowledge or is tedious and time-consuming include labeling X-rays with a diagnosis, protein sequences with a protein type, texts by their topic, tweets by their sentiment, or videos by their genre. In these and numerous other examples, only a few features may be manually labeled due to cost and time constraints. How can we best propagate label information from a small number of expensive labeled features to a vast number of unlabeled ones? This is the question addressed by semi-supervised learning (SSL). This article overviews recent foundational developments on graph-based Bayesian SSL, a probabilistic framework for label propagation using similarities between features. SSL is an active research area and a thorough review of the extant literature is beyond the scope of this article. Our focus will be on topics drawn from our own research that illustrate the wide range of mathematical tools and ideas that underlie the rigorous study of the statistical accuracy and computational efficiency of graph-based Bayesian SSL.
翻译:近几十年来,科学和工程已经因为可用数据数量的大幅增长而发生了革命性的变化。然而,尽管现在数据收集和储存的难度前所未有,但是,尽管现在数据收集和储存的难度空前之大,但以信息标签补充每个特征的标签数据仍具有挑战性。标签过程需要专家知识或乏味和耗时的典型任务包括用诊断、蛋白类蛋白质序列、按主题分类的文本、通过其情绪推文的推文或其基因的视频标注X光的标签。在这些例子和许多其他例子中,只有少数几个特征可以因成本和时间的限制而手工标注。我们如何才能最好地从少量昂贵的标签特征中将标签信息传播到大量未贴标签的特征上?这是半监督性学习(SSL)处理的问题。这篇文章概述了基于图表的巴耶西亚SSL的最新基本发展动态,这是利用特征的相似性传播标签的一个稳定框架。SSL是一个积极的研究领域,对遗址文献的彻底审查超出了这一文章的范围。我们的重点将放在从我们自己的精确的数学和精确度研究中,从我们自己的精确度研究中提取的SLSL的精确性研究的课题上。