多模态机器学习 - 专知主题

会员服务 ·

多模态机器学习

多模态机器学习

我们对世界的体验是多模态的——我们看到物体，听到声音，感觉到纹理，闻到气味，尝到味道。模态是指某件事情发生或经历的方式，一个研究问题如果包含多个模态，就被称为多模态。为了让人工智能在理解我们周围的世界方面取得进展，它需要能够一起解释这种多模态信号。多模态机器学习旨在建立能够处理和关联来自多种模式的信息的模型。这是一个日益重要和具有非凡潜力的充满活力的多学科领域。

知识荟萃

多模态机器学习(Multimodal Machine Learning)专知荟萃

综述

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, arXiv 2019
Deep Multimodal Representation Learning: A Survey, arXiv 2019
Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
Guest Editorial: Image and Language Understanding, IJCV 2017
Representation Learning: A Review and New Perspectives, TPAMI 2013

模型算法

表示学习

Visual Concept-Metaconcept Learning, NeurIPS 2019 [code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, arXiv 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 [code]
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]
OmniNet: A Unified Architecture for Multi-modal Multi-task Learning, arXiv 2019 [code]
Learning Representations by Maximizing Mutual Information Across Views, arXiv 2019 [code]
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arXiv 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]
ViCo: Word Embeddings from Visual Co-occurrences, ICCV 2019
M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019
VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations, CVPR 2019
Multi-Task Learning of Hierarchical Vision-Language Representation, CVPR 2019
Learning Factorized Multimodal Representations, ICLR 2019 [code]
A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks, ICML 2018
Do Neural Network Cross-Modal Mappings Really Bridge Modalities?, ACL 2018
Learning Robust Visual-Semantic Embeddings, ICCV 2017
Deep Multimodal Representation Learning from Temporal Data, CVPR 2017
Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations, COLING 2016
Combining Language and Vision with a Multimodal Skip-gram Model, NAACL 2015
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS 2014
Multimodal Learning with Deep Boltzmann Machines, JMLR 2014
Learning Grounded Meaning Representations with Autoencoders, ACL 2014
DeViSE: A Deep Visual-Semantic Embedding Model, NeurIPS 2013
Multimodal Deep Learning, ICML 2011

多模态融合

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, NeurIPS 2019
MFAS: Multimodal Fusion Architecture Search, CVPR 2019
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, ICLR 2019 [code]
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018 [code]
Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018 [code]
Tensor Fusion Network for Multimodal Sentiment Analysis, EMNLP 2017 [code]
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework, AAAI 2015

多模态对齐

Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019 [code]
Temporal Cycle-Consistency Learning, CVPR 2019
See, Hear, and Read: Deep Aligned Representations, arXiv 2017
On Deep Multi-View Representation Learning, ICML 2015
Unsupervised Alignment of Natural Language Instructions with Video Segments, AAAI 2014
Multimodal Alignment of Videos, MM 2014
Deep Canonical Correlation Analysis, ICML 2013 [code]

多模态翻译

Language2Pose: Natural Language Grounded Pose Forecasting, 3DV 2019 [code]
Reconstructing Faces from Voices, NeurIPS 2019
Speech2Face: Learning the Face Behind a Voice, CVPR 2019 [code]
Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions, ICASSP 2018 [code]

Missing or Imperfect Modalities

知识图谱和知识库

MMKG: Multi-Modal Knowledge Graphs, ESWC 2019
Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs, AKBC 2019
Embedding Multimodal Relational Data for Knowledge Base Completion, EMNLP 2018
A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning, SEM 2018 [code]
Order-Embeddings of Images and Language, ICLR 2016 [code]
Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries, arXiv 2015

可解释学习

Multimodal Explanations by Predicting Counterfactuality in Videos, CVPR 2019
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence, CVPR 2018 [code]
Do Explanations make VQA Models more Predictable to a Human?, EMNLP 2018
Towards Transparent AI Systems: Interpreting Visual Question Answering Models, ICML Workshop on Visualization for Deep Learning 2016

生成式学习

Few-shot Video-to-Video Synthesis, NeurIPS 2019 [code]
Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018
The Multi-Entity Variational Autoencoder, NeurIPS 2017

半监督学习

Semi-supervised Vision-language Mapping via Variational Learning, ICRA 2017
Semi-supervised Multimodal Hashing, arXiv 2017
Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition, IJCAI 2016
Multimodal Semi-supervised Learning for Image Classification, CVPR 2010

自监督学习

语言模型

Neural Language Modeling with Visual Features, arXiv 2019
Learning Multi-Modal Word Representation Grounded in Visual Context, AAAI 2018
Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes, CVPR 2016
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, ICML 2014 [code]

Adversarial Attacks

Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models, NeurIPS Workshop on Visually Grounded Interaction and Language 2018
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning, ACL 2018 [code]
Fooling Vision and Language Models Despite Localization and Attention Mechanism, CVPR 2018

小样本学习

Zero-Shot Learning - The Good, the Bad and the Ugly, CVPR 2017
Zero-Shot Learning Through Cross-Modal Transfer, NIPS 2013

应用

语言和视觉问答

Interactive Language Learning by Question Answering, EMNLP 2019 [code]
Fusion of Detected Objects in Text for Visual Question Answering, arXiv 2019
RUBi: Reducing Unimodal Biases in Visual Question Answering, NeurIPS 2019 [code]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, CVPR 2019 [code]
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge, CVPR 2019 [code]
MUREL: Multimodal Relational Reasoning for Visual Question Answering, CVPR 2019 [code]
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence, CVPR 2019 [code]
Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering, ICML 2019 [code]
Learning to Count Objects in Natural Images for Visual Question Answering, ICLR 2018, [code]
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization, NeurIPS 2018
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, NeurIPS 2018 [code]
RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes, EMNLP 2018 [code]
TVQA: Localized, Compositional Video Question Answering, EMNLP 2018 [code]
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR 2018 [code]
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering, CVPR 2018 [code]
Stacked Latent Attention for Multimodal Reasoning, CVPR 2018
Learning to Reason: End-to-End Module Networks for Visual Question Answering, ICCV 2017 [code]
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017 [code] [dataset generation]
Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension, CVPR 2017 [code]
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, EMNLP 2016 [code]
MovieQA: Understanding Stories in Movies through Question-Answering, CVPR 2016 [code]
VQA: Visual Question Answering, ICCV 2015 [code]

Language Grounding in Vision

Grounded Video Description, CVPR 2019
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions, CVPR 2019
Multilevel Language and Vision Integration for Text-to-Clip Retrieval, AAAI 2019 [code]
Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding, arXiv 2019 [code]
Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos, CVPR 2018
SCAN: Learning Hierarchical Compositional Visual Concepts, ICLR 2018
Visual Coreference Resolution in Visual Dialog using Neural Module Networks, ECCV 2018 [code]
Gated-Attention Architectures for Task-Oriented Language Grounding, AAAI 2018
Using Syntax to Ground Referring Expressions in Natural Images, AAAI 2018 [code]
Grounding language acquisition by training semantic parsers using captioned videos, ACL 2018
Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts, NeurIPS 2017
Localizing Moments in Video with Natural Language, ICCV 2017
What are you talking about? Text-to-Image Coreference, CVPR 2014
Grounded Language Learning from Video Described with Sentences, ACL 2013
Grounded Compositional Semantics for Finding and Describing Images with Sentences, TACL 2013

Vision-and-Dialog Navigation, arXiv 2019 [code]
Hierarchical Decision Making by Generating and Following Natural Language Instructions, arXiv 2019 [code]
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation, ACL 2019
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation, ACL 2019
Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments, CVPR 2019 [code]
Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation, CVPR 2019
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation, CVPR 2019
The Regretful Navigation Agent for Vision-and-Language Navigation, CVPR 2019 [code]
Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation, CVPR 2019 [code]
Multi-modal Discriminative Model for Vision-and-Language Navigation, NAACL SpLU-RoboNLP Workshop 2019
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation, ICLR 2019 [code]
From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following, ICLR 2019
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos, AAAI 2019
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout, NAACL 2019 [code]
Attention Based Natural Language Grounding by Navigating Virtual Environment, IEEE WACV 2019
Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction, EMNLP 2018 [code]
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments, CVPR 2018 [code]
Embodied Question Answering, CVPR 2018 [code]
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation, ECCV 2018

多模态机器翻译

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research, ICCV 2019 [code]
Latent Variable Model for Multi-modal Translation, ACL 2019
Distilling Translations with Visual Awareness, ACL 2019
Probing the Need for Visual Context in Multimodal Machine Translation, NAACL 2019
Emergent Translation in Multi-Agent Communication, ICLR 2018
Zero-Resource Neural Machine Translation with Multi-Agent Communication Game, AAAI 2018
Learning Translations via Images with a Massively Multilingual Image Dataset, ACL 2018
A Visual Attention Grounding Neural Model for Multimodal Machine Translation, EMNLP 2018
Adversarial Evaluation of Multimodal Machine Translation, EMNLP 2018
Doubly-Attentive Decoder for Multi-modal Neural Machine Translation, ACL 2017
An empirical study on the effectiveness of images in Multimodal Neural Machine Translation, EMNLP 2017
Incorporating Global Visual Features into Attention-based Neural Machine Translation, EMNLP 2017
Multimodal Pivots for Image Caption Translation, ACL 2016
Multi30K: Multilingual English-German Image Descriptions, ACL Workshop on Language and Vision 2016
Does Multimodality Help Human and Machine for Translation and Image Captioning?, ACL WMT 2016

Multi-agent Communication

Emergence of Compositional Language with Deep Generational Transmission, ICML 2019
On the Pitfalls of Measuring Emergent Communication, AAMAS 2019 [code]
Emergent Translation in Multi-Agent Communication, ICLR 2018 [code]
Emergent Communication in a Multi-Modal, Multi-Step Referential Game, ICLR 2018 [code]
Emergence of Linguistic Communication From Referential Games with Symbolic and Pixel Input, ICLR 2018
Emergent Communication through Negotiation, ICLR 2018 [code]
Emergence of Grounded Compositional Language in Multi-Agent Populations, AAAI 2018
Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols, NeurIPS 2017
Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog, EMNLP 2017 [code1] [code2]
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, ICCV 2017 code
Multi-agent Cooperation and the Emergence of (natural) Language, ICLR 2017
Learning to Communicate with Deep Multi-agent Reinforcement Learning, NIPS 2016.
Learning multiagent communication with backpropagation, NIPS 2016.
The Emergence of Compositional Structures in Perceptually Grounded Language Games, AI 2005

常识推理

Heterogeneous Graph Learning for Visual Commonsense Reasoning, NeurIPS 2019
SocialIQA: Commonsense Reasoning about Social Interactions, arXiv 2019
From Recognition to Cognition: Visual Commonsense Reasoning, CVPR 2019 [code]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, NAACL 2019

多模态强化学习

Language as an Abstraction for Hierarchical Deep Reinforcement Learning, NeurIPS 2019
Hierarchical Decision Making by Generating and Following Natural Language Instructions, NeurIPS 2019 [code]
Habitat: A Platform for Embodied AI Research, ICCV 2019 [code]
Embodied Multimodal Multitask Learning, arXiv 2019
Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog, SIGDIAL 2018
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning, EMNLP 2017
Reinforcement Learning for Mapping Instructions to Actions, ACL 2009

多模态对话

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, ACL 2019 [code]
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog, NAACL 2019 [code]
Talk the Walk: Navigating New York City through Grounded Dialogue, arXiv 2018
Dialog-based Interactive Image Retrieval, NeurIPS 2018 [code]
Towards Building Large Scale Multimodal Domain-Aware Conversation Systems, arXiv 2017 [code]
Visual Dialog, CVPR 2017 [code]

语言和音频

Lattice Transformer for Speech Translation, ACL 2019
Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation, ACL 2019
Audio Caption: Listen and Tell, ICASSP 2019
Audio-Linguistic Embeddings for Spoken Sentences, ICASSP 2019
From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings, arXiv 2019
From Audio to Semantics: Approaches To End-to-end Spoken Language Understanding, arXiv 2018
Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions, ICASSP 2018 [code]
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning, ICLR 2018
Deep Voice 2: Multi-Speaker Neural Text-to-Speech, NeurIPS 2017
Deep Voice: Real-time Neural Text-to-Speech, ICML 2017
Text-to-Speech Synthesis, 2009

音频和视频

Learning Individual Styles of Conversational Gesture, CVPR 2019 [code]
Capture, Learning, and Synthesis of 3D Speaking Styles, CVPR 2019 [code]
Disjoint Mapping Network for Cross-modal Matching of Voices and Faces, ICLR 2019
Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks, ICASSP 2019 [code]
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input, ECCV 2018 [code]
Seeing Voices and Hearing Faces: Cross-modal Biometric Matching, CVPR 2018 [code]
Learning to Separate Object Sounds by Watching Unlabeled Video, CVPR 2018
Deep Audio-Visual Speech Recognition, IEEE TPAMI 2018
Look, Listen and Learn, ICCV 2017
Unsupervised Learning of Spoken Language with Visual Context, NeurIPS 2016
SoundNet: Learning Sound Representations from Unlabeled Video, NeurIPS 2016 [code]

多媒体描述

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings, ICCV 2019
Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph, CVPR 2019 [code]
Joint Event Detection and Description in Continuous Video Streams, WACVW 2019
Learning to Compose and Reason with Language Tree Structures for Visual Grounding, TPAMI 2019
Neural Baby Talk, CVPR 2018 [code]
Grounding Referring Expressions in Images by Variational Context, CVPR 2018
Video Captioning via Hierarchical Reinforcement Learning, CVPR 2018
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos, CVPR 2018 [code]
Neural Motifs: Scene Graph Parsing with Global Context, CVPR 2018 [code]
No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling, ACL 2018
Generating Descriptions with Grounded and Co-Referenced People, CVPR 2017
DenseCap: Fully Convolutional Localization Networks for Dense Captioning, CVPR 2016
Review Networks for Caption Generation, NeurIPS 2016 [code]
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV 2016 [code]
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, TPAMI 2016 [code]
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 [code]
Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015 [code]
Show and Tell: A Neural Image Caption Generator, CVPR 2015 [code]
A Dataset for Movie Description, CVPR 2015 [code]
What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision, NAACL 2015 [code]
Microsoft COCO: Common Objects in Context, ECCV 2014 [code]

Video Generation from Text

Image Generation from Scene Graphs, CVPR 2018
Learning to Color from Language, NAACL 2018
Generative Adversarial Text to Image Synthesis, ICML 2016

Affect Recognition and Multimodal Language

Towards Multimodal Sarcasm Detection (An Obviously_Perfect Paper), ACL 2019 [code]
Multimodal Language Analysis with Recurrent Multistage Fusion, EMNLP 2018
Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph, ACL 2018 [code]
Multi-attention Recurrent Network for Human Communication Comprehension, AAAI 2018 [code]
AMHUSE - A Multimodal dataset for HUmor SEnsing, ICMI 2017 [code]
Decoding Children’s Social Behavior, CVPR 2013 [code]
Collecting Large, Richly Annotated Facial-Expression Databases from Movies, IEEE Multimedia 2012 [code]
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database, 2008 [code]

医疗

Robotics

See, Feel, Act: Hierarchical Learning for Complex Manipulation Skills with Multi-sensory Fusion, Science Robotics 2019
Early Fusion for Goal Directed Robotic Vision, IROS 2019
Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup, RSS 2019
Probabilistic Multimodal Modeling for Human-Robot Interaction Tasks, RSS 2019
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks, ICRA 2019
Evolving Multimodal Robot Behavior via Many Stepping Stones with the Combinatorial Multi-Objective Evolutionary Algorithm, arXiv 2018
Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction, arXiv 2017
Perching and Vertical Climbing: Design of a Multimodal Robot, ICRA 2014
Multi-Modal Scene Understanding for Robotic Grasping, 2011
Strategies for Multi-Modal Scene Exploration, IROS 2010

WorkShops

Visually Grounded Interaction and Language, NeurIPS 2019, NeurIPS 2018
Emergent Communication: Towards Natural Language, NeurIPS 2019
Workshop on Multimodal Understanding and Learning for Embodied Applications, ACM Multimedia 2019
Beyond Vision and Language: Integrating Real-World Knowledge, EMNLP 2019
The How2 Challenge: New Tasks for Vision & Language, ICML 2019
Visual Question Answering and Dialog, CVPR 2019, CVPR 2017
Multi-modal Learning from Videos, CVPR 2019
Multimodal Learning and Applications Workshop, CVPR 2019, ECCV 2018
Habitat: Embodied Agents Challenge and Workshop, CVPR 2019
Closing the Loop Between Vision and Language & LSMD Challenge, ICCV 2019
Multi-modal Video Analysis and Moments in Time Challenge, ICCV 2019
Cross-Modal Learning in Real World, ICCV 2019
Spatial Language Understanding and Grounded Communication for Robotics, NAACL 2019
YouTube-8M Large-Scale Video Understanding, ICCV 2019, ECCV 2018, CVPR 2017
Language and Vision Workshop, CVPR 2019, CVPR 2018, CVPR 2017, CVPR 2015
Sight and Sound, CVPR 2019, CVPR 2018
The Large Scale Movie Description Challenge (LSMDC), ICCV 2019, ICCV 2017
Wordplay: Reinforcement and Language Learning in Text-based Games, NeurIPS 2018
Interpretability and Robustness in Audio, Speech, and Language, NeurIPS 2018
Multimodal Robot Perception, ICRA 2018
WMT18: Shared Task on Multimodal Machine Translation, EMNLP 2018
Shortcomings in Vision and Language, ECCV 2018
Grand Challenge and Workshop on Human Multimodal Language, ACL 2018
Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, EMNLP 2018, EMNLP 2017, NAACL-HLT 2016, EMNLP 2015, ACL 2014, NAACL-HLT 2013
Visual Understanding Across Modalities, CVPR 2017
International Workshop on Computer Vision for Audio-Visual Media, ICCV 2017
Language Grounding for Robotics, ACL 2017
Computer Vision for Audio-visual Media, ECCV 2016
Language and Vision, ACL 2016, EMNLP 2015

Tutorials

Connecting Language and Vision to Actions, ACL 2018
Machine Learning for Clinicians: Advances for Multi-Modal Health Data, MLHC 2018
Multimodal Machine Learning, ACL 2017, CVPR 2016, ICMI 2016
Vision and Language: Bridging Vision and Language with Deep Learning, ICIP 2017

视频教程

精品内容

【CMU博士论文】分析多模态机器学习模型性能及其在医学报告生成中的评估指标

【CMU博士论文】分析多模态机器学习模型性能及其在医学报告生成中的评估指标

专知会员服务

22+阅读 · 2024年12月16日

【ICML2023】CMU《多模态机器学习》教程，120+页阐述多模态学习最新进展

【ICML2023】CMU《多模态机器学习》教程，120+页阐述多模态学习最新进展

专知会员服务

99+阅读 · 2023年7月26日

开课了！CMU《多模态机器学习》2023课程，附课件

开课了！CMU《多模态机器学习》2023课程，附课件

专知会员服务

74+阅读 · 2023年2月12日

深度学习如何用于蛋白质？微软最新《多模态深度学习的蛋白质工程》报告，附300页ppt与视频

深度学习如何用于蛋白质？微软最新《多模态深度学习的蛋白质工程》报告，附300页ppt与视频

专知会员服务

26+阅读 · 2022年10月12日

如何全面学习多模态？CMU最新《多模态机器学习的基础和最新趋势》综述，65页pdf阐述MML原理、挑战和开放问题，附秋季课程资料

如何全面学习多模态？CMU最新《多模态机器学习的基础和最新趋势》综述，65页pdf阐述MML原理、挑战和开放问题，附秋季课程资料

专知会员服务

119+阅读 · 2022年10月11日

多模态数据如何学习？UIC最新《视觉+X:数据视角下的多模态学习》研究综述，21页pdf涵盖269篇文献详述多模态机器学习进展

多模态数据如何学习？UIC最新《视觉+X:数据视角下的多模态学习》研究综述，21页pdf涵盖269篇文献详述多模态机器学习进展

专知会员服务

71+阅读 · 2022年10月9日

视觉语言多模态预训练综述

视觉语言多模态预训练综述

专知会员服务

122+阅读 · 2022年7月11日

【CVPR2022】CMU《多模态机器学习》教程，200+页阐述表示、对齐、推理、迁移、生成与量化六大挑战的多模态学习系统知识

【CVPR2022】CMU《多模态机器学习》教程，200+页阐述表示、对齐、推理、迁移、生成与量化六大挑战的多模态学习系统知识

专知会员服务

225+阅读 · 2022年6月21日

【AI与医学】多模态机器学习精准医疗健康

【AI与医学】多模态机器学习精准医疗健康

专知会员服务

82+阅读 · 2022年4月25日

【Paul Liang】多模态深度学习，Multimodal Deep Learning

【Paul Liang】多模态深度学习，Multimodal Deep Learning

专知会员服务

185+阅读 · 2022年4月12日

开课了！CMU《多模态机器学习》2022课程，附课件与视频

开课了！CMU《多模态机器学习》2022课程，附课件与视频

专知会员服务

155+阅读 · 2022年2月1日

【硬核课】CMU《多模态机器学习》2020课程，附课件与视频

【硬核课】CMU《多模态机器学习》2020课程，附课件与视频

专知会员服务

138+阅读 · 2020年9月3日

微信扫码咨询专知VIP会员