Bounding-box annotation form has been the most frequently used method for visual object localization tasks. However, bounding-box annotation relies on a large amount of precisely annotating bounding boxes, and it is expensive and laborious. It is impossible to be employed in practical scenarios and even redundant for some applications (such as tiny person localization) that the size would not matter. Therefore, we propose a novel point-based framework for the person localization task by annotating each person as a coarse point (CoarsePoint) instead of an accurate bounding box that can be any point within the object extent. Then, the network predicts the person's location as a 2D coordinate in the image. Although this greatly simplifies the data annotation pipeline, the CoarsePoint annotation inevitably decreases label reliability (label uncertainty) and causes network confusion during training. As a result, we propose a point self-refinement approach that iteratively updates point annotations in a self-paced way. The proposed refinement system alleviates the label uncertainty and progressively improves localization performance. Experimental results show that our approach has achieved comparable object localization performance while saving up to 80$\%$ of annotation cost.
翻译:在视觉物体定位任务中,光标框注释形式是最常用的方法。然而,光标框注释依赖大量精确的注解标记框,而且费用昂贵且费力。不可能在实际情况下使用,甚至对于一些应用程序(如微小个人本地化)来说,其尺寸并不重要,甚至多余。因此,我们为个人定位任务提出了一个基于点的新框架,将每个人标记为粗略点(粗略点),而不是精确的框框,这可以是目标范围内的任何点。然后,网络预测一个人的位置为图像中的2D协调方。虽然这大大简化了数据注解管道,但“粗体点”注释不可避免地降低标签可靠性(标签不确定性),并在培训过程中造成网络混乱。因此,我们建议一种点自我精细化方法,以自我速度反复更新点说明。提议的改进系统可以减轻标签不确定性,并逐步改进本地化性能。实验结果显示,我们的方法在保存80美元成本的同时实现了可比的本地化。