Self-supervised learning methods overcome the key bottleneck for building more capable AI: limited availability of labeled data. However, one of the drawbacks of self-supervised architectures is that the representations that they learn are implicit and it is hard to extract meaningful information about the encoded world states, such as 3D structure of the visual scene encoded in a depth map. Moreover, in the visual domain such representations only rarely undergo evaluations that may be critical for downstream tasks, such as vision for autonomous cars. Herein, we propose a framework for evaluating visual representations for illumination invariance in the context of depth perception. We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method. We propose a novel architecture that extends the predictive coding approach: PRedictive Lateral bottom-Up and top-Down Encoder-decoder Network (PreludeNet), which explicitly learns to infer and predict depth from video frames. In PreludeNet, the encoder's stack of predictive coding layers is trained in a self-supervised manner, while the predictive decoder is trained in a supervised manner to infer or predict the depth. We evaluate the robustness of our model on a new synthetic dataset, in which lighting conditions (such as overall illumination, and effect of shadows) can be be parametrically adjusted while keeping all other aspects of the world constant. PreludeNet achieves both competitive depth inference performance and next frame prediction accuracy. We also show how this new network architecture, coupled with the hybrid fully-supervised/self-supervised learning method, achieves balance between the said performance and invariance to changes in lighting. The proposed framework for evaluating visual representations can be extended to diverse task domains and invariance tests.
翻译:自我监督的学习方法克服了建设更有能力的AI的关键瓶颈: 标签数据的有限可用性。 然而, 自监管架构的一个缺点是, 它们学习的表达方式是隐含的, 很难获取关于编码世界状态的有意义的信息, 比如在深度地图中编码的视觉场景的 3D 结构。 此外, 在视觉域中, 这种表达方式很少进行对下游任务可能至关重要的评价, 比如自治汽车的愿景。 在此, 我们提议了一个框架, 用于评价在深度感知背景下的照明变异的视觉深度显示。 我们开发了一个新的预测性更精确的网络编码结构, 一个完全受监督/ 自我监督的学习方法。 我们提出了一个新的预测性能结构结构, 向上部的上部下部和下部 Encoder- decoder 网络( PreludeNet) 明确学习了从视频框中推断和预测深度。 在PreludeNet 中, 我们的预测下部下层的预测性能的精确性能和下部下部的深度显示, 在不断的深度分析的深度显示中, 我们的自我监督性结构中, 将进行自我评估的自我评估, 演示的运行的系统将显示为新的结构的自我评估, 方向的自我评估, 显示, 在新的结构的自我评估, 显示中进行自我评估, 显示的自我评估的自我评估, 在新的结构的自我评估, 显示的自我评估, 方向上进行自我评估, 方向上显示, 显示的自我评估, 方向的自我评估, 。