We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving. Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve's planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve's distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.
翻译:我们提出了Cornserve,一种用于新兴任意到任意多模态模型的高效在线服务系统。任意到任意模型接受文本与多模态数据(如图像、视频、音频)的组合作为输入,并生成文本与多模态数据的组合作为输出,这在模型服务中引入了请求类型、计算路径和计算规模异构性。Cornserve允许模型开发者描述通用任意到任意模型的计算图,该图由异构组件构成,包括多模态编码器、自回归模型(如大型语言模型LLMs)以及多模态生成器(如扩散变换器DiTs)。基于此,Cornserve的规划器自动为模型找到优化的部署方案,包括是否以及如何根据模型和工作负载特征将模型分解为更小的组件。随后,Cornserve的分布式运行时按照该方案执行模型,在在线服务过程中高效处理任意到任意模型的异构性。评估表明,Cornserve能够高效服务多样化的任意到任意模型和工作负载,与现有解决方案相比,实现了高达3.81倍的吞吐量提升和高达5.79倍的尾部延迟降低。