FlowMesh：面向可组合大语言模型工作流的服务架构 (FlowMesh: A Service Fabric for Composable LLM Workflows)

AI deployment increasingly resembles a pipeline of data transformation, fine-tuning, and agent interactions rather than a monolithic LLM job; recent examples include RLHF/RLAIF training and agentic workflows. To cope with this shift, we propose FlowMesh, a multi-tenant service fabric that executes and optimizes these workloads as one shared service instead of isolated pipelines. It decomposes workflows into fine-grained operators with recorded lineage, enabling de-duplication of work across users and batching requests on the same hardware while preserving per-workflow provenance. A global control plane maintains a cluster-wide pool of ready operators and uses a single utility function to pick both the batch and the worker, balancing throughput, cost, and data locality on heterogeneous GPUs. The data plane is an elastic fleet of stateless workers backed by a content-addressable store, enabling rapid, automatic scale-out, safe retry after preemption, and portability across managed clusters such as Kubernetes and geo-distributed GPU marketplaces such as Vast.ai. Compared with baseline solutions, FlowMesh achieves up to 3.8x cost reduction and 2.0x lower energy usage, provides a similar or better latency profile, and remains efficient under dynamic and failure-prone conditions.

翻译：人工智能部署日益呈现出数据转换、微调与智能体交互的流水线形态，而非单一的大语言模型任务；近期示例包括RLHF/RLAIF训练与智能体工作流。为应对此转变，我们提出FlowMesh——一种多租户服务架构，将此类工作负载作为统一共享服务而非孤立流水线来执行与优化。该系统将工作流分解为具有记录谱系的细粒度算子，在保持各工作流溯源性的同时，实现跨用户任务去重与同硬件请求批处理。全局控制平面维护集群范围的就绪算子池，通过单一效用函数同时选择批次与工作节点，在异构GPU上平衡吞吐量、成本与数据局部性。数据平面是由内容寻址存储支持的无状态弹性工作节点集群，支持快速自动扩缩容、抢占后安全重试，并可跨Kubernetes等托管集群及Vast.ai等地理分布式GPU市场实现移植。与基线方案相比，FlowMesh最高可实现3.8倍成本降低与2.0倍能耗下降，提供相当或更优的延迟特性，并在动态且易故障环境中保持高效运行。