iQIYI-VID:多模式身份识别的大型数据集 (iQIYI-VID: A Large Dataset for Multi-modal Person Identification)

Yuanliu Liu,Peipei Shi,Bo Peng,He Yan,Yong Zhou,Bing Han,Yi Zheng,Chao Lin,Jianbin Jiang,Yin Fan,Tingwei Gao,Ganwen Wang,Jian Liu,Xiangju Lu,Danming Xie

Person identification in the wild is very challenging due to great variation in poses, face quality, clothes, makeup and so on. Traditional research, such as face recognition, person re-identification, and speaker recognition, often focuses on a single modal of information, which is inadequate to handle all the situations in practice. Multi-modal person identification is a more promising way that we can jointly utilize face, head, body, audio features, and so on. In this paper, we introduce iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2%. We evaluated the state-of-art models of face recognition, person re-identification, and speaker recognition on the iQIYI-VID dataset. Experimental results show that these models are still far from being perfect for task of person identification in the wild. We further demonstrate that a simple fusion of multi-modal features can improve person identification considerably. We have released the dataset online to promote multi-modal person identification research.

翻译：野外的人的身份识别由于面容、面容、服装、化妆等差异很大而非常具有挑战性。传统研究,如面部识别、个人重新身份和语音识别等,往往侧重于单一的信息模式,这种模式不足以处理实际中的所有情况。多式个人识别是一个更有希望的方式,我们可以共同利用面部、头部、身体、音频特征等。在本文中,我们引入了iQIYI-VID,这是多模式人身份识别的最大视频数据集。它由5,000名名名人600K视频剪辑组成。这些视频剪辑是从400K小时的各类在线视频中提取的,从电影、各种节目、电视系列到新闻广播等。所有视频剪辑都经过仔细的人类笔记过程,而且标签的错误率低于0.2%。我们评估了脸识别、人重新识别和语音识别iQIYI-VID数据集的最新模型。实验结果显示,这些模型仍然远远不能完美地完成多种身份识别任务,从电影、多样性、电视系列、电视系列到新闻广播。我们可以进一步展示一个简单的在线身份识别模型。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

【CVPR2020-小鹏汽车】判别性多模态语音识别, Discriminative Multi-modality SR

专知会员服务

41+阅读 · 2020年5月13日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【IBM】在视觉和关系推理中迁移学习，Transfer Learning in Visual and Relational Reasoning

专知会员服务

45+阅读 · 2020年1月15日

【NUS】神经问题生成的最近进展（Recent Advances in Neural Question Generation）

专知会员服务

16+阅读 · 2019年12月22日