Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person reidentification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio (voices) and visual (faces) streams are densely annotated. APES contains over 1.9K identities labeled along 36 hours of video, making it the largest dataset available for untrimmed audiovisual person search. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity. To showcase the potential of our new dataset, we propose an audiovisual baseline and benchmark for person retrieval. Our study shows that modeling audiovisual cues benefits the recognition of people's identities. To enable reproducibility and promote future research, the dataset annotations and baseline code are available at: https://github.com/fuankarion/audiovisual-person-search
翻译:人类可以说是视频流中最重要的主题之一,许多真实世界应用,如视频摘要或视频编辑工作流程,往往需要相关人士的自动搜索和检索。尽管在个人重新识别和检索领域做出了巨大努力,但很少有作品制定了视听搜索战略。在本文中,我们展示了视听人搜索数据集(APES),这是一个由音频(声音)和视觉(脸)流高度注解的未剪辑的视频组成的新数据集。APES包含与36小时视频一同标注的1.9K以上身份,使其成为可供未剪辑的视听人士搜索的最大数据集。APES的主要属性是它包括将面部与同一身份的语音部分联系起来的密集时间说明。为了展示我们的新数据集的潜力,我们提议了一个视听基线和基准,用于个人检索。我们的研究显示,模拟视听提示有利于人们身份的承认。为了能够重现和促进未来的研究,数据集说明和基线代码可在以下网址查阅:https://github.com/fuankariionion-visual-sears。