This is an opinion paper. We hope to deliver a key message that current visual recognition systems are far from complete, i.e., recognizing everything that human can recognize, yet it is very unlikely that the gap can be bridged by continuously increasing human annotations. Based on the observation, we advocate for a new type of pre-training task named learning-by-compression. The computational models (e.g., a deep network) are optimized to represent the visual data using compact features, and the features preserve the ability to recover the original data. Semantic annotations, when available, play the role of weak supervision. An important yet challenging issue is the evaluation of image recovery, where we suggest some design principles and future research directions. We hope our proposal can inspire the community to pursue the compression-recovery tradeoff rather than the accuracy-complexity tradeoff.
翻译:这是一份意见文件。 我们希望传达一个关键信息,即当前视觉识别系统远非完全,即认识到人类能够认识到的一切,然而,通过不断增加人类说明来弥补差距的可能性极小。根据观察,我们主张采用一种新的培训前任务,名为逐压缩学习。计算模型(例如深网络)最优化地代表了使用紧凑特征的视觉数据,而其特征则保留了恢复原始数据的能力。语义说明(如果有的话)发挥了薄弱的监督作用。一个重要的但具有挑战性的问题是图像恢复评估,我们在此提出一些设计原则和未来研究方向。我们希望我们的建议能够激励社会追求压缩-回收交易,而不是精确-复合交易。