While wearable cameras are becoming increasingly popular, locating relevant information in large unstructured collections of egocentric images is still a tedious and time consuming processes. This paper addresses the problem of organizing egocentric photo streams acquired by a wearable camera into semantically meaningful segments. First, contextual and semantic information is extracted for each image by employing a Convolutional Neural Networks approach. Later, by integrating language processing, a vocabulary of concepts is defined in a semantic space. Finally, by exploiting the temporal coherence in photo streams, images which share contextual and semantic attributes are grouped together. The resulting temporal segmentation is particularly suited for further analysis, ranging from activity and event recognition to semantic indexing and summarization. Experiments over egocentric sets of nearly 17,000 images, show that the proposed approach outperforms state-of-the-art methods.
翻译:虽然可磨损的相机越来越受欢迎,但将相关信息定位在大型非结构化的自我中心图像收藏中仍然是一个乏味和耗时的过程。本文论述将由可磨损的相机获得的以自我为中心的照片流组织成具有语义意义的部分的问题。首先,通过使用进化神经网络的方法为每个图像提取了背景和语义信息。随后,通过整合语言处理,在语义空间中定义了概念词汇。最后,通过利用相片流的时间一致性,将共享背景和语义属性的图像组合在一起。由此产生的时间分割特别适合进一步分析,从活动和事件识别到语义索引和合成。对近17 000张自我中心图像的实验表明,拟议的方法超越了艺术状态的方法。