We study active object tracking, where a tracker takes as input the visual observation (i.e., frame sequence) and produces the camera control signal (e.g., move forward, turn left, etc.). Conventional methods tackle the tracking and the camera control separately, which is challenging to tune jointly. It also incurs many human efforts for labeling and many expensive trial-and-errors in realworld. To address these issues, we propose, in this paper, an end-to-end solution via deep reinforcement learning, where a ConvNet-LSTM function approximator is adopted for the direct frame-toaction prediction. We further propose an environment augmentation technique and a customized reward function, which are crucial for a successful training. The tracker trained in simulators (ViZDoom, Unreal Engine) shows good generalization in the case of unseen object moving path, unseen object appearance, unseen background, and distracting object. It can restore tracking when occasionally losing the target. With the experiments over the VOT dataset, we also find that the tracking ability, obtained solely from simulators, can potentially transfer to real-world scenarios.
翻译:我们研究活跃的物体跟踪, 追踪器将视觉观测( 框架序列) 作为输入, 并生成相机控制信号( 例如, 向前移动, 向左转等) 。 常规方法将跟踪和相机控制分开, 这对联合调和具有挑战性。 它也要求人们作出许多努力, 在现实世界中贴标签, 以及许多昂贵的试验和感应器。 为了解决这些问题, 我们在本文件中建议, 通过深层强化学习, 一种端到端解决方案, 即 ConvNet- LSTM 函数匹配器被采用来进行直接框架对动作的预测 。 我们还提议一种环境增强技术和定制的奖赏功能, 这对于成功培训至关重要 。 跟踪器在模拟器( Vizdoom, 非现实引擎) 中受过训练, 显示在看不见的物体移动路径、 看不见的物体外观、 看不见的背景 和转移对象的情况下, 良好的概括性 。 在偶尔丢失目标时, 它可以恢复跟踪 。 通过 VOT 数据集的实验, 我们还发现追踪能力, 只能从模拟器中获取到真实世界的跟踪能力 。