In this paper, we present a vision for a new generation of multimodal streaming systems that embed MLLMs as first-class operators, enabling real-time query processing across multiple modalities. Achieving this is non-trivial: while recent work has integrated MLLMs into databases for multimodal queries, streaming systems require fundamentally different approaches due to their strict latency and throughput requirements. Our approach proposes novel optimizations at all levels, including logical, physical, and semantic query transformations that reduce model load to improve throughput while preserving accuracy. We demonstrate this with \system{}, a prototype leveraging such optimizations to improve performance by more than an order of magnitude. Moreover, we discuss a research roadmap that outlines open research challenges for building a scalable and efficient multimodal stream processing systems.
翻译:本文提出新一代多模态流处理系统的愿景,该系统将MLLM作为一等操作符嵌入,实现跨多模态的实时查询处理。实现这一目标具有显著挑战:尽管近期研究已将MLLM集成至数据库以支持多模态查询,但流处理系统因其严格的延迟与吞吐量要求需要根本不同的方法。我们提出的方法在逻辑层、物理层和语义查询转换等各层面均设计了创新优化方案,通过降低模型负载在保持精度的同时提升吞吐量。我们通过原型系统\system{}验证了该方案,其利用此类优化使性能提升超过一个数量级。此外,我们探讨了研究路线图,为构建可扩展且高效的多模态流处理系统指明了开放的研究挑战。