1. 基于解纠缠多模态表示学习的推荐系统

Title: Disentangled Multimodal Representation Learning for Recommendation

Published: 2022-03-10

Url: http://arxiv.org/abs/2203.05406v1

Authors: Fan Liu,Zhiyong Cheng,Huilin Chen,Anan Liu,Liqiang Nie,Mohan Kankanhalli


Many multimodal recommender systems have been proposed to exploit the richside information associated with users or items (e.g., user reviews and itemimages) for learning better user and item representations to enhance therecommendation performance. Studies in psychology show that users haveindividual differences in the utilization of different modalities fororganizing information. Therefore, for a certain factor of an item (such asappearance or quality), the features of different modalities are of differentimportance to a user. However, existing methods ignore the fact that differentmodalities contribute differently to a user's preferences on various factors ofan item. In light of this, in this paper, we propose a novel DisentangledMultimodal Representation Learning (DMRL) recommendation model, which cancapture users' attention to different modalities on each factor in userpreference modeling. In particular, we adopt a disentangled representationtechnique to ensure the features of different factors in each modality areindependent to each other. A multimodal attention mechanism is then designed tocapture user's modality preference for each factor. Based on the estimatedweights obtained by the attention mechanism, we make recommendation bycombining the preference scores of a user's preferences to each factor of thetarget item over different modalities. Extensive evaluations on five real-worlddatasets demonstrate the superiority of our method compared with existingmethods.

2. 多模态深度学习的方法与应用综述

Title: A Review on Methods and Applications in Multimodal Deep Learning

Published: 2022-02-18

Url: http://arxiv.org/abs/2202.09195v1

Authors: Jabeen Summaira,Xi Li,Amin Muhammad Shoib,Jabbar Abdul


Deep Learning has implemented a wide range of applications and has becomeincreasingly popular in recent years. The goal of multimodal deep learning(MMDL) is to create models that can process and link information using variousmodalities. Despite the extensive development made for unimodal learning, itstill cannot cover all the aspects of human learning. Multimodal learning helpsto understand and analyze better when various senses are engaged in theprocessing of information. This paper focuses on multiple types of modalities,i.e., image, video, text, audio, body gestures, facial expressions, andphysiological signals. Detailed analysis of the baseline approaches and anin-depth study of recent advancements during the last five years (2017 to 2021)in multimodal deep learning applications has been provided. A fine-grainedtaxonomy of various multimodal deep learning methods is proposed, elaboratingon different applications in more depth. Lastly, main issues are highlightedseparately for each domain, along with their possible future researchdirections.

3. GCNet:面向会话推荐不完全多模态学习的图补全网络研究

Title: GCNet: Graph Completion Network for Incomplete Multimodal Learning in  Conversation

Published: 2022-03-04

Url: http://arxiv.org/abs/2203.02177v1

Authors: Zheng Lian,Lan Chen,Licai Sun,Bin Liu,Jianhua Tao

对话已经成为社交媒体平台上的一种关键数据格式。由于对话在人机交互中的广泛应用,从情感、内容和其他方面理解对话也越来越受到研究者的关注。在现实世界中,我们经常遇到不完全模态的问题,这已经成为会话理解的核心问题。为了解决这个问题,研究人员提出了多种方法。然而,现有的方法主要针对个人话语或医学图像,而不是会话数据,无法利用会话中的时间和说话人信息。为此,我们提出了一个新的不完全多模态学习转换框架,称为“图形完整网络(GCNet)”,填补了现有研究的空白。我们的GCNet包含两个精心设计的基于图形神经网络的模块,“Speaker GNN”和“Temporal GNN”,用于捕获会话中的时态和说话人信息。为了充分利用特征学习中的完整和不完整数据,我们以端到端的方式联合优化分类和重构。为了验证我们的方法的有效性,我们在三个基准会话数据集上进行了实验。实验结果表明,我们的GCNet在不完全多模态学习中优于现有的先进方法。

Conversations have become a critical data format on social media platforms.Understanding conversation from emotion, content, and other aspects alsoattracts increasing attention from researchers due to its widespreadapplication in human-computer interaction. In real-world environments, we oftenencounter the problem of incomplete modalities, which has become a core issueof conversation understanding. To address this problem, researchers proposevarious methods. However, existing approaches are mainly designed forindividual utterances or medical images rather than conversational data, whichcannot exploit temporal and speaker information in conversations. To this end,we propose a novel framework for incomplete multimodal learning inconversations, called "Graph Complete Network (GCNet)", filling the gap ofexisting works. Our GCNet contains two well-designed graph neural network-basedmodules, "Speaker GNN" and "Temporal GNN", to capture temporal and speakerinformation in conversations. To make full use of complete and incomplete datain feature learning, we jointly optimize classification and reconstruction inan end-to-end manner. To verify the effectiveness of our method, we conductexperiments on three benchmark conversational datasets. Experimental resultsdemonstrate that our GCNet is superior to existing state-of-the-art approachesin incomplete multimodal learning.

4. 机器人自主检测人:一种对类别内变化具有鲁棒性的多模态深度对比学习方法

Title: Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Published: 2022-03-01

Url: http://arxiv.org/abs/2203.00187v1

Authors: Angus Fung,Beno Benhabib,Goldie Nejat

在拥挤和/或混乱的以人为中心的环境中,包括医院、长期护理、商店和机场,机器人对人的检测是一项挑战,因为人可能会被其他人或物体遮挡,并因衣服或姿势的变化而变形。由于光线不好,也可能会失去有区别的视觉特征。在本文中,我们提出了一种新的多模式人体检测体系结构来解决移动机器人在组内变化情况下的人体检测问题。我们提出了一种两阶段训练方法,使用1)一种我们定义为时变多模态对比学习(TimCLR)的独特预训练方法,以及2)一种多模态FasterR CNN(MFRCNN)检测器。TimCLR通过无监督学习学习在组内变化下不变的人表征。我们的方法的独特之处在于,除了合成数据增强之外,它还从多模态图像序列的自然变量生成图像对,并对比跨模态特征以在不同模态之间传递不变性。MFRCNN检测器使用这些预训练的特征对RGB-D图像进行微调和人员检测。大量实验验证了我们的DL体系结构在以人为中心的拥挤和杂乱环境中的性能。我们的多模态人体变形检测方法和多模态人体变形检测方法在不同的光照条件下的检测精度都优于现有的多模态人体变形检测方法。

Robotic detection of people in crowded and/or cluttered human-centeredenvironments including hospitals, long-term care, stores and airports ischallenging as people can become occluded by other people or objects, anddeform due to variations in clothing or pose. There can also be loss ofdiscriminative visual features due to poor lighting. In this paper, we presenta novel multimodal person detection architecture to address the mobile robotproblem of person detection under intraclass variations. We present a two-stagetraining approach using 1) a unique pretraining method we define as TemporalInvariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal FasterR-CNN (MFRCNN) detector. TimCLR learns person representations that areinvariant under intraclass variations through unsupervised learning. Ourapproach is unique in that it generates image pairs from natural variationswithin multimodal image sequences, in addition to synthetic data augmentation,and contrasts crossmodal features to transfer invariances between differentmodalities. These pretrained features are used by the MFRCNN detector forfinetuning and person detection from RGB-D images. Extensive experimentsvalidate the performance of our DL architecture in both human-centered crowdedand cluttered environments. Results show that our method outperforms existingunimodal and multimodal person detection approaches in terms of detectionaccuracy in detecting people with body occlusions and pose deformations indifferent lighting conditions.

5. 物联网数据的多模态联邦学习

Title: Multimodal Federated Learning on IoT Data

Published: 2022-02-18

Url: http://arxiv.org/abs/2109.04833v2

Authors: Yuchen Zhao,Payam Barnaghi,Hamed Haddadi


Federated learning is proposed as an alternative to centralized machinelearning since its client-server structure provides better privacy protectionand scalability in real-world applications. In many applications, such as smarthomes with Internet-of-Things (IoT) devices, local data on clients aregenerated from different modalities such as sensory, visual, and audio data.Existing federated learning systems only work on local data from a singlemodality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learningframework that trains autoencoders to extract shared or correlatedrepresentations from different local data modalities on clients. In addition,we propose a multimodal FedAvg algorithm to aggregate local autoencoderstrained on different data modalities. We use the learned global autoencoder fora downstream classification task with the help of auxiliary labelled data onthe server. We empirically evaluate our framework on different modalitiesincluding sensory data, depth camera videos, and RGB camera videos. Ourexperimental results demonstrate that introducing data from multiple modalitiesinto federated learning can improve its classification performance. Inaddition, we can use labelled data from only one modality for supervisedlearning on the server and apply the learned model to testing data from othermodalities to achieve decent F1 scores (e.g., with the best performance beinghigher than 60%), especially when combining contributions from both unimodalclients and multimodal clients.

6. 基于说话人识别和基于BERT模型的迁移学习的多模态情感识别

Title: Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

Published: 2022-02-16

Url: http://arxiv.org/abs/2202.08974v1

Authors: Sarala Padi,Seyed Omid Sadjadi,Dinesh Manocha,Ram D. Sriram


Automatic emotion recognition plays a key role in computer-human interactionas it has the potential to enrich the next-generation artificial intelligencewith emotional intelligence. It finds applications in customer and/orrepresentative behavior analysis in call centers, gaming, personal assistants,and social robots, to mention a few. Therefore, there has been an increasingdemand to develop robust automatic methods to analyze and recognize the variousemotions. In this paper, we propose a neural network-based emotion recognitionframework that uses a late fusion of transfer-learned and fine-tuned modelsfrom speech and text modalities. More specifically, we i) adapt a residualnetwork (ResNet) based model trained on a large-scale speaker recognition taskusing transfer learning along with a spectrogram augmentation approach torecognize emotions from speech, and ii) use a fine-tuned bidirectional encoderrepresentations from transformers (BERT) based model to represent and recognizeemotions from the text. The proposed system then combines the ResNet andBERT-based model scores using a late fusion strategy to further improve theemotion recognition performance. The proposed multimodal solution addresses thedata scarcity limitation in emotion recognition using transfer learning, dataaugmentation, and fine-tuning, thereby improving the generalization performanceof the emotion recognition models. We evaluate the effectiveness of ourproposed multimodal approach on the interactive emotional dyadic motion capture(IEMOCAP) dataset. Experimental results indicate that both audio and text-basedmodels improve the emotion recognition performance and that the proposedmultimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.

7. CommerceM:具有全方位检索的大规模商业多模态表示学习

Title: CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

Published: 2022-02-15

Url: http://arxiv.org/abs/2202.07247v1

Authors: Licheng Yu,Jun Chen,Animesh Sinha,Mengjiao MJ Wang,Hugo Chen,Tamara L. Berg,Ning Zhang


We introduce CommerceMM - a multimodal model capable of providing a diverseand granular understanding of commerce topics associated to the given piece ofcontent (image, text, image+text), and having the capability to generalize to awide range of tasks, including Multimodal Categorization, Image-Text Retrieval,Query-to-Product Retrieval, Image-to-Product Retrieval, etc. We follow thepre-training + fine-tuning training regime and present 5 effective pre-trainingtasks on image-text pairs. To embrace more common and diverse commerce datawith text-to-multimodal, image-to-multimodal, and multimodal-to-multimodalmapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks,called Omni-Retrieval pre-training. The pre-training is conducted in anefficient manner with only two forward/backward updates for the combined 14tasks. Extensive experiments and analysis show the effectiveness of each task.When combining all pre-training tasks, our model achieves state-of-the-artperformance on 7 commerce-related downstream tasks after fine-tuning.Additionally, we propose a novel approach of modality randomization todynamically adjust our model under different efficiency constraints.

8. 结合半监督和监督学习解决多模态用户状态识别中的数据稀缺问题

Title: Addressing Data Scarcity in Multimodal User State Recognition by  Combining Semi-Supervised and Supervised Learning

Published: 2022-02-08

Url: http://arxiv.org/abs/2202.03775v1

Authors: Hendric Voß,Heiko Wersing,Stefan Kopp


Detecting mental states of human users is crucial for the development ofcooperative and intelligent robots, as it enables the robot to understand theuser's intentions and desires. Despite their importance, it is difficult toobtain a large amount of high quality data for training automatic recognitionalgorithms as the time and effort required to collect and label such data isprohibitively high. In this paper we present a multimodal machine learningapproach for detecting dis-/agreement and confusion states in a human-robotinteraction environment, using just a small amount of manually annotated data.We collect a data set by conducting a human-robot interaction study and developa novel preprocessing pipeline for our machine learning approach. By combiningsemi-supervised and supervised architectures, we are able to achieve an averageF1-score of 81.1\% for dis-/agreement detection with a small amount of labeleddata and a large unlabeled data set, while simultaneously increasing therobustness of the model compared to the supervised approach.

9. 几何多模态对比表征学习

Title: GMC -- Geometric Multimodal Contrastive Representation Learning

Published: 2022-02-08

Url: http://arxiv.org/abs/2202.03390v2

Authors: Petra Poklukar,Miguel Vasco,Hang Yin,Francisco S. Melo,Ana Paiva,Danica Kragic


Learning representations of multimodal data that are both informative androbust to missing modalities at test time remains a challenging problem due tothe inherent heterogeneity of data obtained from different channels. To addressit, we present a novel Geometric Multimodal Contrastive (GMC) representationlearning method comprised of two main components: i) a two-level architectureconsisting of modality-specific base encoder, allowing to process an arbitrarynumber of modalities to an intermediate representation of fixed dimensionality,and a shared projection head, mapping the intermediate representations to alatent representation space; ii) a multimodal contrastive loss function thatencourages the geometric alignment of the learned representations. Weexperimentally demonstrate that GMC representations are semantically rich andachieve state-of-the-art performance with missing modality information on threedifferent learning problems including prediction and reinforcement learningtasks.

10. 通过模仿和自监督学习创建多模态交互代理

Title: Creating Multimodal Interactive Agents with Imitation and  Self-Supervised Learning

Published: 2022-02-02

Url: http://arxiv.org/abs/2112.03763v2

Authors: DeepMind Interactive Agents Team,Josh Abramson,Arun Ahuja,Arthur Brussee,Federico Carnevale,Mary Cassin,Felix Fischer,Petko Georgiev,Alex Goldin,Mansi Gupta,Tim Harley,Felix Hill,Peter C Humphreys,Alden Hung,Jessica Landon,Timothy Lillicrap,Hamza Merzic,Alistair Muldal,Adam Santoro,Guy Scully,Tamara von Glehn,Greg Wayne,Nathaniel Wong,Chen Yan,Rui Zhu


A common vision from science fiction is that robots will one day inhabit ourphysical spaces, sense the world as we do, assist our physical labours, andcommunicate with us through natural language. Here we study how to designartificial agents that can interact naturally with humans using thesimplification of a virtual environment. We show that imitation learning ofhuman-human interactions in a simulated world, in conjunction withself-supervised learning, is sufficient to produce a multimodal interactiveagent, which we call MIA, that successfully interacts with non-adversarialhumans 75% of the time. We further identify architectural and algorithmictechniques that improve performance, such as hierarchical action selection.Altogether, our results demonstrate that imitation of multi-modal, real-timehuman behaviour may provide a straightforward and surprisingly effective meansof imbuing agents with a rich behavioural prior from which agents might then befine-tuned for specific purposes, thus laying a foundation for training capableagents for interactive robots or digital assistants. A video of MIA's behaviourmay be found at https://youtu.be/ZFgRhviF7mY



