Main subjects usually exist in the images or videos, as they are the objects that the photographer wants to highlight. Human viewers can easily identify them but algorithms often confuse them with other objects. Detecting the main subjects is an important technique to help machines understand the content of images and videos. We present a new dataset with the goal of training models to understand the layout of the objects and the context of the image then to find the main subjects among them. This is achieved in three aspects. By gathering images from movie shots created by directors with professional shooting skills, we collect the dataset with strong diversity, specifically, it contains 107\,700 images from 21\,540 movie shots. We labeled them with the bounding box labels for two classes: subject and non-subject foreground object. We present a detailed analysis of the dataset and compare the task with saliency detection and object detection. ImageSubject is the first dataset that tries to localize the subject in an image that the photographer wants to highlight. Moreover, we find the transformer-based detection model offers the best result among other popular model architectures. Finally, we discuss the potential applications and conclude with the importance of the dataset.
翻译:图像或视频中通常存在主要主题, 因为它们是摄影师想要突出显示的对象。 人类观众可以很容易地辨别它们, 但算法往往将它们与其他对象混为一谈。 检测主要主题是一项帮助机器理解图像和视频内容的重要技术。 我们提出了一个新的数据集, 目的是培训模型, 以了解对象的布局和图像的背景, 然后找到其中的主要对象。 这是在三个方面实现的。 通过收集由具有专业射击技巧的导演制作的电影镜头中的图像, 我们收集的数据集非常多样, 特别是它包含21\ 540电影镜头中的107\ 700图像。 我们用两个类别: 主题和非主题的前方对象的框标签给他们贴上标签。 我们详细分析数据集, 并将任务与突出的检测和对象的检测进行比较。 图像对象是第一个试图在摄影师想要突出的图像中将主题本地化的数据集。 此外, 我们发现基于变压器的探测模型提供了最佳结果, 以及其他流行的模型结构。 最后, 我们用数据集的重要性来讨论潜在应用和结论。