Learning on big data brings success for artificial intelligence (AI), but the annotation and training costs are expensive. In future, learning on small data is one of the ultimate purposes of AI, which requires machines to recognize objectives and scenarios relying on small data as humans. A series of machine learning models is going on this way such as active learning, few-shot learning, deep clustering. However, there are few theoretical guarantees for their generalization performance. Moreover, most of their settings are passive, that is, the label distribution is explicitly controlled by one specified sampling scenario. This survey follows the agnostic active sampling under a PAC (Probably Approximately Correct) framework to analyze the generalization error and label complexity of learning on small data using a supervised and unsupervised fashion. With these theoretical analyses, we categorize the small data learning models from two geometric perspectives: the Euclidean and non-Euclidean (hyperbolic) mean representation, where their optimization solutions are also presented and discussed. Later, some potential learning scenarios that may benefit from small data learning are then summarized, and their potential learning scenarios are also analyzed. Finally, some challenging applications such as computer vision, natural language processing that may benefit from learning on small data are also surveyed.
翻译:在大数据方面的学习为人工智能带来成功(AI),但批注和培训成本是昂贵的。未来,小数据方面的学习是AI的最终目的之一,它要求机器确认以小数据作为人类的目标和假设情景。一系列机器学习模型正在以这种方式进行,例如积极学习、少见学习、深聚。然而,对于其概括性表现,理论上几乎没有什么保障。此外,它们的大多数环境都是被动的,即标签分布由特定抽样方案明确控制。这项调查遵循了在PAC(可能大致正确)框架下进行的不可知的积极抽样调查,以分析通用错误,并用监督和不受监督的方式将小数据学习的复杂程度贴上标签。通过这些理论分析,我们从两个几何角度对小型数据学习模型进行分类:Euclidean 和非Euclidean (Hyperpolic) 表示方式,其中还介绍和讨论其优化解决方案。随后,将总结出一些可能从小型数据学习中受益的潜在学习情景,并分析其潜在的学习情景。最后,一些具有挑战性意义的应用,例如计算机视觉研究等的应用程序可能用于研究。