Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner.
翻译:在过去几年里,以AI技术的进步为标志的庞大的技术创新浪潮正在深刻地改变工业和社会,然而,在前进的道路上,我们面临一个关键的挑战,即我们满足迅速增长的情景具体要求的能力受到获得相应数量的培训数据的成本的严重限制。这种困难局面本质上是由于主流学习模式的局限性造成的:我们需要为每个新情景开发一个新的模式,其基础是大量有详细说明的数据,而且通常是从零开始。在解决这一根本问题的过程中,我们超越并发展了一个名为ININTER的新学习模式。通过从多个来源学习监督信号,正在培训的模式将发展出强大的通用性。我们评估了我们关于26个众所周知的数据集的模式,这些数据集涵盖计算机愿景的四类任务。在大多数情况下,我们的模式仅适应目标领域培训数据的10%,比受过培训的对应人员更接近全套数据,往往具有相当的开放的幅度。这是朝着一个有希望的阶段迈出的重要一步,在这种模式中,具有总体愿景的能力可以大大降低我们对数据的依赖性,正在逐步采用新的模型,从而加速采用新的基础设施。