视频分析/多模态学习论文、代码、数据集大列表

会员服务 ·

视频分析/多模态学习论文、代码、数据集大列表

2019 年 7 月 13 日 专知

【导读】这篇文章介绍了一些包括视频分析，尤其是视频的多模态学习的研究，包括论文、代码、数据集等。内容包括多模态视频分析、视频时刻本地化、视频检索、视频广告、常识推理、视频高亮、物体跟踪、音视频对话系统、动作识别等。

数据集

AVA dataset

提供对视频的标注，帮助理解人类活动

https://research.google.com/ava/index.html

PyVideoResearch

一个视频研究的仓库，包括常用方法，数据集，任务等

https://github.com/gsig/PyVideoResearch

How2 Dataset

多模态语言学习数据集

https://arxiv.org/pdf/1811.00347.pdf

https://github.com/srvk/how2-dataset

视频时刻本地化数据集

https://github.com/metalbubble/moments_models

http://moments.csail.mit.edu/

预训练的视频与图片Pytorch模型

https://github.com/alexandonian/pretorched-x

YouTube8M数据集

https://ai.googleblog.com/2019/06/announcing-youtube-8m-segments-dataset.html

工具

DCASE

可用于场景、视觉分类和检测的一些辅助函数

https://dcase-repo.github.io/dcase_util/index.html

论文

动作识别

Long-Term Feature Banks for Detailed Video Understanding (CVPR2019)

https://arxiv.org/pdf/1812.05038.pdf

https://github.com/facebookresearch/video-long-term-feature-banks

Deep Learning for Video Classification and Captioning

https://arxiv.org/pdf/1609.06782.pdf

Large-scale Video Classification with Convolutional Neural Networks

https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/42455.pdf)

Learning Spatiotemporal Features with 3D Convolutional Networks

http://www.cvfoundation.org/openaccess/content_iccv_2015/papers/Tran_Learning_Spatiotemporal_Features_ICCV_2015_paper.pdf

Two-Stream Convolutional Networks for Action Recognition in Video

https://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

http://www.cvfoundation.org/openaccess/content_cvpr_2015/papers/Wang_Action_Recognition_With_2015_CVPR_paper.pdf

Non-local neural networks

http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.pdf

Learning Correspondence from the Cycle-consistency of Time

https://arxiv.org/pdf/1903.07593.pdf

https://github.com/xiaolonw/TimeCycle

3D ConvNets in Pytorch

https://github.com/Tushar-N/pytorch-resnet3d

多模态视频分析

Awsome list for multimodal learning

https://github.com/pliang279/multimodal-ml-reading-list

VideoBERT: A Joint Model for Video and Language Representation Learning

https://arxiv.org/abs/1904.01766、

AENet: Learning Deep Audio Features for Video Analysis

https://arxiv.org/pdf/1701.00599.pdf

https://github.com/znaoya/aenet

Look, Listen and Learn

https://arxiv.org/pdf/1705.08168.pdf

Objects that Sound

https://arxiv.org/pdf/1712.06651

Learning to Separate Object Sounds by Watching Unlabeled Video

https://arxiv.org/pdf/1804.01665.pdf

Ambient Sound Provides Supervision for Visual Learning

http://www.eccv2016.org/files/posters/O-1B-01.pdf

视频时刻本地化

Localizing Moments in Video with Natural Language

https://arxiv.org/pdf/1708.01641.pdf

https://github.com/LisaAnne/LocalizingMoments

视频检索

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data."

https://arxiv.org/pdf/1804.02516.pdf

https://github.com/antoine77340/Mixture-of-Embedding-Experts

Cross-Modal and Hierarchical Modeling of Video and Text

https://arxiv.org/pdf/1810.07212.pdf

A dataset for movie description.

https://arxiv.org/pdf/1501.02530.pdf

Web-scale Multimedia Search for Internet Video Content.

http://www.lujiang.info/resources/Thesis.pdf

视频广告

Automatic understanding of image and video advertisements

http://openaccess.thecvf.com/content_cvpr_2017/papers/Hussain_Automatic_Understanding_of_CVPR_2017_paper.pdf

http://people.cs.pitt.edu/~kovashka/ads/

Multimodal Representation of Advertisements Using Segment-level Autoencoders

https://sail.usc.edu/publications/files/p418-somandepalli.pdf

https://github.com/usc-sail/mica-multimodal-ads

Story Understanding in Video Advertisements

http://people.cs.pitt.edu/~kovashka/ye_buettner_kovashka_bmvc2018.pdf

https://github.com/yekeren/Story-Video_ads_understanding

ADVISE: Symbolism and External Knowledge for Decoding Advertisements

http://people.cs.pitt.edu/~kovashka/ye_kovashka_advise_eccv2018.pdf

https://github.com/yekeren/ADVISE

视频常识推理

From Recognition to Cognition: Visual Commonsense Reasoning

https://arxiv.org/pdf/1811.10830.pdf

https://visualcommonsense.com/

视频高亮预测

Video highlight prediction using audience chat reactions

目标跟踪

SenseTime's research platform for single object tracking research, implementing algorithms like SiamRPN and SiamMask

https://github.com/STVIR/pysot

音视频对话

https://github.com/batra-mlp-lab/avsd

-END-

专 · 知

专知，专业可信的人工智能知识分发，让认知协作更快更好！欢迎登录www.zhuanzhi.ai，注册登录专知，获取更多AI知识资料！

欢迎微信扫一扫加入专知人工智能知识星球群，获取最新AI专业干货知识教程视频资料和与专家交流咨询！

请加专知小助手微信（扫一扫如下二维码添加），加入专知人工智能主题群，咨询技术商务合作~

专知《深度学习:算法到实战》课程全部完成！550+位同学在学习，现在报名，限时优惠！网易云课堂人工智能畅销榜首位！

点击“阅读原文”，了解报名专知《深度学习:算法到实战》课程

登录查看更多

相关内容

多模态学习

关注 44

现实世界中的信息通常以不同的模态出现。例如，图像通常与标签和文本解释联系在一起;文本包含图像以便更清楚地表达文章的主要思想。不同的模态由迥异的统计特性刻画。例如，图像通常表示为特征提取器的像素强度或输出，而文本则表示为离散的词向量。由于不同信息资源的统计特性不同，发现不同模态之间的关系是非常重要的。多模态学习是一个很好的模型，可以用来表示不同模态的联合表示。多模态学习模型也能在观察到的情况下填补缺失的模态。多模态学习模型中，每个模态对应结合了两个深度玻尔兹曼机（deep boltzmann machines）.另外一个隐藏层被放置在两个玻尔兹曼机上层，以给出联合表示。

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日

近期必读的9篇 CVPR 2019【视觉目标跟踪】相关论文和代码

专知会员服务

33+阅读 · 2020年1月10日

近期必读的10篇【可解释性】相关论文和代码（AAAI、CVPR、WSDM）

专知会员服务

58+阅读 · 2020年1月10日

FB大牛撰文推介，PySlowFast！Facebook开源视频理解前沿算法代码库，视频SOTA技术全在这了！

专知会员服务

65+阅读 · 2020年1月6日