Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
翻译:近期在图像分类和自然语言处理方面的发展,加上社交媒体使用量的迅速增长,使得在实时发现世界各地突发事件方面取得了根本性的进展。应急反应是可从这些进展中获益的一个领域。通过对数十亿文本和图像进行一分钟的处理,可以自动发现事件,使应急反应工作者能够更好地评估迅速变化的情况并相应部署资源。迄今为止,该领域的大多数事件探测技术都集中在只使用图像或只使用文本的方法上,限制了检测性能并影响提供给危机反应小组的信息的质量。在本文中,我们提出了一个新的多式联运融合方法,利用图像和文本作为投入。特别是,我们引入了一个交叉注意模块,能够通过抽样方式从薄弱的模式中过滤不具有说服力和误导性的组成部分。此外,我们采用了基于多式图表的方法,在培训期间将不同的多式联运配对结合在一起,以更好地规范学习过程,并通过从不同样本中建立新的匹配配对来处理有限的培训数据。我们表明,我们的方法在三大危机相关任务上比非形式方法和强的多式基准差。