APES: 在未剪贴的视频中搜索视听人士 (APES: Audiovisual Person Search in Untrimmed Video)

Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person reidentification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio (voices) and visual (faces) streams are densely annotated. APES contains over 1.9K identities labeled along 36 hours of video, making it the largest dataset available for untrimmed audiovisual person search. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity. To showcase the potential of our new dataset, we propose an audiovisual baseline and benchmark for person retrieval. Our study shows that modeling audiovisual cues benefits the recognition of people's identities. To enable reproducibility and promote future research, the dataset annotations and baseline code are available at: https://github.com/fuankarion/audiovisual-person-search

翻译：人类可以说是视频流中最重要的主题之一,许多真实世界应用,如视频摘要或视频编辑工作流程,往往需要相关人士的自动搜索和检索。尽管在个人重新识别和检索领域做出了巨大努力,但很少有作品制定了视听搜索战略。在本文中,我们展示了视听人搜索数据集(APES),这是一个由音频(声音)和视觉(脸)流高度注解的未剪辑的视频组成的新数据集。APES包含与36小时视频一同标注的1.9K以上身份,使其成为可供未剪辑的视听人士搜索的最大数据集。APES的主要属性是它包括将面部与同一身份的语音部分联系起来的密集时间说明。为了展示我们的新数据集的潜力,我们提议了一个视听基线和基准,用于个人检索。我们的研究显示,模拟视听提示有利于人们身份的承认。为了能够重现和促进未来的研究,数据集说明和基线代码可在以下网址查阅:https://github.com/fuankariionion-visual-sears。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

专知会员服务

34+阅读 · 2020年6月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日