This paper tackles the problem of processing and combining efficiently arbitrary long data streams, coming from different modalities with different acquisition frequencies. Common applications can be, for instance, long-time industrial or real-life systems monitoring from multimodal heterogeneous data (sensor data, monitoring report, images, etc.). To tackle this problem, we propose StreaMulT, a Streaming Multimodal Transformer, relying on cross-modal attention and an augmented memory bank to process arbitrary long input sequences at training time and run in a streaming way at inference. StreaMulT reproduces state-of-the-art results on CMU-MOSEI dataset, while being able to deal with much longer inputs than other models such as previous Multimodal Transformer.
翻译:本文探讨了处理和高效整合来自不同方式和不同获取频率的任意长途数据流的问题;例如,常见应用可以是长期的工业或实际生命系统监测,来自多式多种数据(传感器数据、监测报告、图像等)。为了解决这一问题,我们提议SteraMulT,一个流动的多模式变换器,依靠跨模式的关注和增强的记忆库,在培训时间处理任意的长途输入序列,以流传的方式进行引证。StreaMulT复制了CMU-MOSEI数据集的最新结果,同时能够处理比以往多模式变换器等其他模型更长得多的投入。