Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute labels to pixel. In this paper we propose treating nouns as object labels and adjectives as visual attribute labels. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution. Using the extracted labels as handles, our system empowers a user to verbally refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interests enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g. smart phones, Google Glass, living room devices). We demonstrate our system on a large number of real-world images with varying complexity. To help understand the tradeoffs compared to traditional mouse based interactions, results are reported for both a large scale quantitative evaluation and a user study.
翻译:人类用名词和形容词描述图像, 而算法则则在以像素组表示的图像上操作。 缩小人类希望访问图像与典型表达方式之间的这一差距是图像分析的目标, 其中包括将对象分配到像素标签, 并将标签属性归属于像素。 在本文中, 我们提议将名词作为对象标签和形容词描述成视觉属性标签。 这使我们能够将图像分析问题形成为共同估计每像素对象和从一组培训图像中指定标签的一种问题。 我们提出了一个高效的( 互动时间) 解决方案。 我们用提取的标签作为手柄, 我们的系统授权用户对结果进行口头推敲。 这可以使图像的手解析变成像素对象/ 标签, 与人类的语义属性标签相对应。 动性选择对象可以让新颖和自然的互动模式用于与新一代设备互动( 例如智能电话、 谷歌玻璃、 客房设备) 。 我们用大量被提取的标签来演示我们的系统。 我们用大量真实世界的系统来对结果进行语言分析, 并用不同的复杂程度来理解 。