收录最新多模态推荐的前沿研究工作。
Title: Disentangled Multimodal Representation Learning for Recommendation
Published: 2022-03-10
Url: http://arxiv.org/abs/2203.05406v1
Authors: Fan Liu,Zhiyong Cheng,Huilin Chen,Anan Liu,Liqiang Nie,Mohan Kankanhalli
许多多模式推荐系统被提出来利用与用户或项目(例如,用户评论和项目图像)相关的丰富信息来学习更好的用户和项目表示,以提高推荐性能。心理学研究表明,用户在使用不同的信息组织方式方面存在个体差异。因此,对于物品的某个因素(如外观或质量),不同形态的特征对用户的重要性不同。然而,现有的方法忽略了一个事实,即不同的模态对用户对项目的各种因素的偏好有不同的影响。有鉴于此,本文提出了一种新的非纠缠多模式表示学习(DMRL)推荐模型,该模型可以在用户偏好建模中吸引用户对每个因素上不同模式的关注。特别是,我们采用了一种分离的表征技术,以确保每个模态中不同因素的特征相互独立。然后设计了一个多模态注意机制来捕捉用户对每个因素的模态偏好。基于由注意机制获得的估计权重,我们通过组合用户在不同模式下对目标项目的每个因素的偏好得分来进行推荐。对五个真实数据集的广泛评估表明,与现有方法相比,我们的方法具有优越性。
Many multimodal recommender systems have been proposed to exploit the richside information associated with users or items (e.g., user reviews and itemimages) for learning better user and item representations to enhance therecommendation performance. Studies in psychology show that users haveindividual differences in the utilization of different modalities fororganizing information. Therefore, for a certain factor of an item (such asappearance or quality), the features of different modalities are of differentimportance to a user. However, existing methods ignore the fact that differentmodalities contribute differently to a user's preferences on various factors ofan item. In light of this, in this paper, we propose a novel DisentangledMultimodal Representation Learning (DMRL) recommendation model, which cancapture users' attention to different modalities on each factor in userpreference modeling. In particular, we adopt a disentangled representationtechnique to ensure the features of different factors in each modality areindependent to each other. A multimodal attention mechanism is then designed tocapture user's modality preference for each factor. Based on the estimatedweights obtained by the attention mechanism, we make recommendation bycombining the preference scores of a user's preferences to each factor of thetarget item over different modalities. Extensive evaluations on five real-worlddatasets demonstrate the superiority of our method compared with existingmethods.
Title: A Review on Methods and Applications in Multimodal Deep Learning
Published: 2022-02-18
Url: http://arxiv.org/abs/2202.09195v1
Authors: Jabeen Summaira,Xi Li,Amin Muhammad Shoib,Jabbar Abdul
深度学习已经实现了广泛的应用,近年来变得越来越流行。多模式深度学习(MMDL)的目标是创建能够使用各种模式处理和链接信息的模型。尽管单峰学习得到了广泛的发展,但它仍然不能覆盖人类学习的所有方面。当各种感官参与信息处理时,多模态学习有助于更好地理解和分析。本文主要研究多种形式,即图像、视频、文本、音频、身体姿势、面部表情和生理信号。详细分析了基准方法,并对过去五年(2017年至2021年)在多模式深度学习应用方面的最新进展进行了深入研究。提出了各种多模态深度学习方法的细粒度分类法,对不同的应用进行了更深入的阐述。最后,重点介绍了每个领域的主要问题,以及未来可能的研究方向。
Deep Learning has implemented a wide range of applications and has becomeincreasingly popular in recent years. The goal of multimodal deep learning(MMDL) is to create models that can process and link information using variousmodalities. Despite the extensive development made for unimodal learning, itstill cannot cover all the aspects of human learning. Multimodal learning helpsto understand and analyze better when various senses are engaged in theprocessing of information. This paper focuses on multiple types of modalities,i.e., image, video, text, audio, body gestures, facial expressions, andphysiological signals. Detailed analysis of the baseline approaches and anin-depth study of recent advancements during the last five years (2017 to 2021)in multimodal deep learning applications has been provided. A fine-grainedtaxonomy of various multimodal deep learning methods is proposed, elaboratingon different applications in more depth. Lastly, main issues are highlightedseparately for each domain, along with their possible future researchdirections.
Title: GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation
Published: 2022-03-04
Url: http://arxiv.org/abs/2203.02177v1
Authors: Zheng Lian,Lan Chen,Licai Sun,Bin Liu,Jianhua Tao
对话已经成为社交媒体平台上的一种关键数据格式。由于对话在人机交互中的广泛应用,从情感、内容和其他方面理解对话也越来越受到研究者的关注。在现实世界中,我们经常遇到不完全模态的问题,这已经成为会话理解的核心问题。为了解决这个问题,研究人员提出了多种方法。然而,现有的方法主要针对个人话语或医学图像,而不是会话数据,无法利用会话中的时间和说话人信息。为此,我们提出了一个新的不完全多模态学习转换框架,称为“图形完整网络(GCNet)”,填补了现有研究的空白。我们的GCNet包含两个精心设计的基于图形神经网络的模块,“Speaker GNN”和“Temporal GNN”,用于捕获会话中的时态和说话人信息。为了充分利用特征学习中的完整和不完整数据,我们以端到端的方式联合优化分类和重构。为了验证我们的方法的有效性,我们在三个基准会话数据集上进行了实验。实验结果表明,我们的GCNet在不完全多模态学习中优于现有的先进方法。
Conversations have become a critical data format on social media platforms.Understanding conversation from emotion, content, and other aspects alsoattracts increasing attention from researchers due to its widespreadapplication in human-computer interaction. In real-world environments, we oftenencounter the problem of incomplete modalities, which has become a core issueof conversation understanding. To address this problem, researchers proposevarious methods. However, existing approaches are mainly designed forindividual utterances or medical images rather than conversational data, whichcannot exploit temporal and speaker information in conversations. To this end,we propose a novel framework for incomplete multimodal learning inconversations, called "Graph Complete Network (GCNet)", filling the gap ofexisting works. Our GCNet contains two well-designed graph neural network-basedmodules, "Speaker GNN" and "Temporal GNN", to capture temporal and speakerinformation in conversations. To make full use of complete and incomplete datain feature learning, we jointly optimize classification and reconstruction inan end-to-end manner. To verify the effectiveness of our method, we conductexperiments on three benchmark conversational datasets. Experimental resultsdemonstrate that our GCNet is superior to existing state-of-the-art approachesin incomplete multimodal learning.
Title: Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations
Published: 2022-03-01
Url: http://arxiv.org/abs/2203.00187v1
Authors: Angus Fung,Beno Benhabib,Goldie Nejat
在拥挤和/或混乱的以人为中心的环境中,包括医院、长期护理、商店和机场,机器人对人的检测是一项挑战,因为人可能会被其他人或物体遮挡,并因衣服或姿势的变化而变形。由于光线不好,也可能会失去有区别的视觉特征。在本文中,我们提出了一种新的多模式人体检测体系结构来解决移动机器人在组内变化情况下的人体检测问题。我们提出了一种两阶段训练方法,使用1)一种我们定义为时变多模态对比学习(TimCLR)的独特预训练方法,以及2)一种多模态FasterR CNN(MFRCNN)检测器。TimCLR通过无监督学习学习在组内变化下不变的人表征。我们的方法的独特之处在于,除了合成数据增强之外,它还从多模态图像序列的自然变量生成图像对,并对比跨模态特征以在不同模态之间传递不变性。MFRCNN检测器使用这些预训练的特征对RGB-D图像进行微调和人员检测。大量实验验证了我们的DL体系结构在以人为中心的拥挤和杂乱环境中的性能。我们的多模态人体变形检测方法和多模态人体变形检测方法在不同的光照条件下的检测精度都优于现有的多模态人体变形检测方法。
Robotic detection of people in crowded and/or cluttered human-centeredenvironments including hospitals, long-term care, stores and airports ischallenging as people can become occluded by other people or objects, anddeform due to variations in clothing or pose. There can also be loss ofdiscriminative visual features due to poor lighting. In this paper, we presenta novel multimodal person detection architecture to address the mobile robotproblem of person detection under intraclass variations. We present a two-stagetraining approach using 1) a unique pretraining method we define as TemporalInvariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal FasterR-CNN (MFRCNN) detector. TimCLR learns person representations that areinvariant under intraclass variations through unsupervised learning. Ourapproach is unique in that it generates image pairs from natural variationswithin multimodal image sequences, in addition to synthetic data augmentation,and contrasts crossmodal features to transfer invariances between differentmodalities. These pretrained features are used by the MFRCNN detector forfinetuning and person detection from RGB-D images. Extensive experimentsvalidate the performance of our DL architecture in both human-centered crowdedand cluttered environments. Results show that our method outperforms existingunimodal and multimodal person detection approaches in terms of detectionaccuracy in detecting people with body occlusions and pose deformations indifferent lighting conditions.
Title: Multimodal Federated Learning on IoT Data
Published: 2022-02-18
Url: http://arxiv.org/abs/2109.04833v2
Authors: Yuchen Zhao,Payam Barnaghi,Hamed Haddadi
联邦学习被提议作为集中式机器学习的替代方案,因为它的客户机-服务器结构在实际应用中提供了更好的隐私保护和可扩展性。在许多应用程序中,例如带有物联网(IoT)设备的智能家居,客户端上的本地数据是从不同的模式生成的,例如感官、视觉和音频数据。现有的联邦学习系统只处理来自单一模式的本地数据,这限制了系统的可扩展性。在本文中,我们提出了一个多模式半监督联邦学习框架,该框架训练自动编码器从客户端的不同本地数据模式中提取共享或相关的表示。此外,我们还提出了一种多模态FedAvg算法来聚合在不同数据模式下训练的局部自动编码器。在服务器上辅助标记数据的帮助下,我们使用学习到的全局自动编码器进行下游分类任务。我们根据经验评估了我们在不同模式下的框架,包括感官数据、深度摄像头视频和RGB摄像头视频。我们的实验结果表明,将来自多个模式的数据引入联邦学习可以提高其分类性能。此外,我们可以仅使用一种模式的标记数据在服务器上进行监督学习,并将学习到的模型应用于其他模式的测试数据,以获得良好的F1成绩(例如,最佳性能高于60%),尤其是在结合UnimodalClient和多模式客户端的贡献时。
Federated learning is proposed as an alternative to centralized machinelearning since its client-server structure provides better privacy protectionand scalability in real-world applications. In many applications, such as smarthomes with Internet-of-Things (IoT) devices, local data on clients aregenerated from different modalities such as sensory, visual, and audio data.Existing federated learning systems only work on local data from a singlemodality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learningframework that trains autoencoders to extract shared or correlatedrepresentations from different local data modalities on clients. In addition,we propose a multimodal FedAvg algorithm to aggregate local autoencoderstrained on different data modalities. We use the learned global autoencoder fora downstream classification task with the help of auxiliary labelled data onthe server. We empirically evaluate our framework on different modalitiesincluding sensory data, depth camera videos, and RGB camera videos. Ourexperimental results demonstrate that introducing data from multiple modalitiesinto federated learning can improve its classification performance. Inaddition, we can use labelled data from only one modality for supervisedlearning on the server and apply the learned model to testing data from othermodalities to achieve decent F1 scores (e.g., with the best performance beinghigher than 60%), especially when combining contributions from both unimodalclients and multimodal clients.
Title: Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models
Published: 2022-02-16
Url: http://arxiv.org/abs/2202.08974v1
Authors: Sarala Padi,Seyed Omid Sadjadi,Dinesh Manocha,Ram D. Sriram
自动情感识别在人机交互中起着关键作用,它有可能用情感智能丰富下一代人工智能。它可以应用于呼叫中心、游戏、个人助理和社交机器人等领域的客户和/或代表性行为分析。因此,人们越来越需要开发鲁棒的自动方法来分析和识别各种运动。在本文中,我们提出了一个基于神经网络的情感识别框架,该框架使用了语音和文本模式的迁移学习和微调模型的后期融合。更具体地说,我们i)采用基于剩余网络(ResNet)的模型,在大规模说话人识别任务中进行训练,使用转移学习和频谱图增强方法从语音中识别情感,以及ii)使用基于变压器(BERT)模型的微调双向编码器来表示和识别文本中的情感。然后,该系统使用后期融合策略将基于ResNet和bert的模型分数结合起来,以进一步提高运动识别性能。提出的多模态解决方案利用迁移学习、数据增强和微调解决了情感识别中的数据稀缺限制,从而提高了情感识别模型的泛化性能。我们在交互式情绪二元运动捕捉(IEMOCAP)数据集上评估了我们提出的多模态方法的有效性。实验结果表明,基于音频和文本的模型都提高了情感识别性能,并且提出的多模式解决方案在IEMOCAP基准上取得了最先进的结果。
Automatic emotion recognition plays a key role in computer-human interactionas it has the potential to enrich the next-generation artificial intelligencewith emotional intelligence. It finds applications in customer and/orrepresentative behavior analysis in call centers, gaming, personal assistants,and social robots, to mention a few. Therefore, there has been an increasingdemand to develop robust automatic methods to analyze and recognize the variousemotions. In this paper, we propose a neural network-based emotion recognitionframework that uses a late fusion of transfer-learned and fine-tuned modelsfrom speech and text modalities. More specifically, we i) adapt a residualnetwork (ResNet) based model trained on a large-scale speaker recognition taskusing transfer learning along with a spectrogram augmentation approach torecognize emotions from speech, and ii) use a fine-tuned bidirectional encoderrepresentations from transformers (BERT) based model to represent and recognizeemotions from the text. The proposed system then combines the ResNet andBERT-based model scores using a late fusion strategy to further improve theemotion recognition performance. The proposed multimodal solution addresses thedata scarcity limitation in emotion recognition using transfer learning, dataaugmentation, and fine-tuning, thereby improving the generalization performanceof the emotion recognition models. We evaluate the effectiveness of ourproposed multimodal approach on the interactive emotional dyadic motion capture(IEMOCAP) dataset. Experimental results indicate that both audio and text-basedmodels improve the emotion recognition performance and that the proposedmultimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.
Title: CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Published: 2022-02-15
Url: http://arxiv.org/abs/2202.07247v1
Authors: Licheng Yu,Jun Chen,Animesh Sinha,Mengjiao MJ Wang,Hugo Chen,Tamara L. Berg,Ning Zhang
我们介绍了CommerceM——一种多模态模型,能够提供对与给定内容(图像、文本、图像+文本)相关的商业主题的多样化和细粒度的理解,并能够概括到各种任务,包括多模式分类、图像文本检索、查询到产品检索,图像到产品检索等。我们遵循预培训+微调培训制度,并在图像-文本对上提出5个有效的预培训任务。为了使用文本到多模态、图像到多模态以及多模态到多模态的映射,我们提出了另外9种新的跨模态和跨对检索任务,称为全向检索预训练。预培训是以有效的方式进行的,对于组合任务,只进行了两次正向/反向更新。大量的实验和分析表明了每项任务的有效性。当组合所有预训练任务时,我们的模型在微调后在7个与商业相关的下游任务上实现了最先进的性能。此外,我们还提出了一种新的模态随机化方法,在不同的效率约束下动态调整我们的模型。
We introduce CommerceMM - a multimodal model capable of providing a diverseand granular understanding of commerce topics associated to the given piece ofcontent (image, text, image+text), and having the capability to generalize to awide range of tasks, including Multimodal Categorization, Image-Text Retrieval,Query-to-Product Retrieval, Image-to-Product Retrieval, etc. We follow thepre-training + fine-tuning training regime and present 5 effective pre-trainingtasks on image-text pairs. To embrace more common and diverse commerce datawith text-to-multimodal, image-to-multimodal, and multimodal-to-multimodalmapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks,called Omni-Retrieval pre-training. The pre-training is conducted in anefficient manner with only two forward/backward updates for the combined 14tasks. Extensive experiments and analysis show the effectiveness of each task.When combining all pre-training tasks, our model achieves state-of-the-artperformance on 7 commerce-related downstream tasks after fine-tuning.Additionally, we propose a novel approach of modality randomization todynamically adjust our model under different efficiency constraints.
Title: Addressing Data Scarcity in Multimodal User State Recognition by Combining Semi-Supervised and Supervised Learning
Published: 2022-02-08
Url: http://arxiv.org/abs/2202.03775v1
Authors: Hendric Voß,Heiko Wersing,Stefan Kopp
检测人类用户的心理状态对于合作和智能机器人的发展至关重要,因为它使机器人能够理解用户的意图和愿望。尽管这些数据很重要,但很难获得大量高质量的数据来训练自动识别算法,因为收集和标记这些数据所需的时间和精力可能非常高。在本文中,我们提出了一种多模式机器学习方法,用于检测人类-机器人交互环境中的不一致/混乱状态,只需使用少量手动注释的数据。我们通过进行人机交互研究来收集数据集,并为我们的机器学习方法开发了新的预处理管道。通过结合半监督和监督体系结构,我们能够在少量labeleddata和大量未标记数据集的情况下,实现81.1%的不一致/一致检测平均F1分数,同时与监督方法相比,提高了模型的鲁棒性。
Detecting mental states of human users is crucial for the development ofcooperative and intelligent robots, as it enables the robot to understand theuser's intentions and desires. Despite their importance, it is difficult toobtain a large amount of high quality data for training automatic recognitionalgorithms as the time and effort required to collect and label such data isprohibitively high. In this paper we present a multimodal machine learningapproach for detecting dis-/agreement and confusion states in a human-robotinteraction environment, using just a small amount of manually annotated data.We collect a data set by conducting a human-robot interaction study and developa novel preprocessing pipeline for our machine learning approach. By combiningsemi-supervised and supervised architectures, we are able to achieve an averageF1-score of 81.1\% for dis-/agreement detection with a small amount of labeleddata and a large unlabeled data set, while simultaneously increasing therobustness of the model compared to the supervised approach.
Title: GMC -- Geometric Multimodal Contrastive Representation Learning
Published: 2022-02-08
Url: http://arxiv.org/abs/2202.03390v2
Authors: Petra Poklukar,Miguel Vasco,Hang Yin,Francisco S. Melo,Ana Paiva,Danica Kragic
由于从不同渠道获得的数据固有的异质性,在测试时学习多模态数据的表示形式,既能提供信息,又能识别缺失的模态,仍然是一个具有挑战性的问题。为了解决这个问题,我们提出了一种新的几何多模态对比(GMC)表征学习方法,该方法由两个主要部分组成:i)一个两级架构,由特定于模态的基本编码器组成,允许将任意数量的模态处理为固定维度的中间表征,以及一个共享投影头,将中间表示映射到中间表示空间;ii)多模态对比损失函数,该函数鼓励学习表征的几何对齐。我们的实验证明,GMC表示法语义丰富,在预测和强化学习任务等三种不同的学习问题上,在缺少模态信息的情况下实现了最先进的性能。
Learning representations of multimodal data that are both informative androbust to missing modalities at test time remains a challenging problem due tothe inherent heterogeneity of data obtained from different channels. To addressit, we present a novel Geometric Multimodal Contrastive (GMC) representationlearning method comprised of two main components: i) a two-level architectureconsisting of modality-specific base encoder, allowing to process an arbitrarynumber of modalities to an intermediate representation of fixed dimensionality,and a shared projection head, mapping the intermediate representations to alatent representation space; ii) a multimodal contrastive loss function thatencourages the geometric alignment of the learned representations. Weexperimentally demonstrate that GMC representations are semantically rich andachieve state-of-the-art performance with missing modality information on threedifferent learning problems including prediction and reinforcement learningtasks.
Title: Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning
Published: 2022-02-02
Url: http://arxiv.org/abs/2112.03763v2
Authors: DeepMind Interactive Agents Team,Josh Abramson,Arun Ahuja,Arthur Brussee,Federico Carnevale,Mary Cassin,Felix Fischer,Petko Georgiev,Alex Goldin,Mansi Gupta,Tim Harley,Felix Hill,Peter C Humphreys,Alden Hung,Jessica Landon,Timothy Lillicrap,Hamza Merzic,Alistair Muldal,Adam Santoro,Guy Scully,Tamara von Glehn,Greg Wayne,Nathaniel Wong,Chen Yan,Rui Zhu
科幻小说中的一个共同愿景是,机器人有朝一日将居住在我们的物理空间中,像我们一样感知世界,协助我们的体力劳动,并通过自然语言与我们交流。在这里,我们研究如何利用虚拟环境的简化来设计能够与人类自然交互的人工智能体。我们证明,在模拟世界中对人与人之间的互动进行模仿学习,再加上自我监督学习,足以产生一种多模式互动,我们称之为MIA,它在75%的时间里成功地与非对手的人类互动。我们进一步确定了可提高性能的体系结构和算法技术,如分层动作选择。总之,我们的研究结果表明,模仿多模态、实时的人类行为可能会提供一种直接且出人意料的有效方法,即向代理人灌输丰富的行为先验知识,从而使代理人可以根据特定目的进行调整,从而为交互式机器人或数字助理的培训提供了基础。MIA行为的视频可在https://youtu.be/ZFgRhviF7mY
A common vision from science fiction is that robots will one day inhabit ourphysical spaces, sense the world as we do, assist our physical labours, andcommunicate with us through natural language. Here we study how to designartificial agents that can interact naturally with humans using thesimplification of a virtual environment. We show that imitation learning ofhuman-human interactions in a simulated world, in conjunction withself-supervised learning, is sufficient to produce a multimodal interactiveagent, which we call MIA, that successfully interacts with non-adversarialhumans 75% of the time. We further identify architectural and algorithmictechniques that improve performance, such as hierarchical action selection.Altogether, our results demonstrate that imitation of multi-modal, real-timehuman behaviour may provide a straightforward and surprisingly effective meansof imbuing agents with a rich behavioural prior from which agents might then befine-tuned for specific purposes, thus laying a foundation for training capableagents for interactive robots or digital assistants. A video of MIA's behaviourmay be found at https://youtu.be/ZFgRhviF7mY