This paper sustains the position that the time has come for thinking of learning machines that conquer visual skills in a truly human-like context, where a few human-like object supervisions are given by vocal interactions and pointing aids only. This likely requires new foundations on computational processes of vision with the final purpose of involving machines in tasks of visual description by living in their own visual environment under simple man-machine linguistic interactions. The challenge consists of developing machines that learn to see without needing to handle visual databases. This might open the doors to a truly orthogonal competitive track concerning deep learning technologies for vision which does not rely on the accumulation of huge visual databases.
翻译:本文坚持了这样一种立场,即现在该是时候思考学习机器,这些机器在真正人性的环境中征服视觉技能了,在这种环境中,只有声音互动和指针辅助器才能提供一些像人类一样的物体监督。这可能需要在视觉的计算过程中建立新的基础,最终目的是让机器在简单的人-机器语言互动下生活在自己的视觉环境中,从而参与视觉描述任务。挑战在于开发机器,学会看而不需要处理视觉数据库。这可能会打开一个真正或多或少的竞争性轨道的大门,涉及不依赖庞大视觉数据库积累的深层次的视觉学习技术。