Agentic AI frameworks add a decision-making orchestrator embedded with external tools, including web search, Python interpreter, contextual database, and others, on top of monolithic LLMs, turning them from passive text oracles into autonomous problem-solvers that can plan, call tools, remember past steps, and adapt on the fly. This paper aims to characterize and understand the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first systematically characterize Agentic AI on the basis of orchestrator/decision making component, inference path dynamics and repetitiveness of the agentic flow which directly influences the system-level performance. Thereafter, based on the characterization, we choose five representative agentic AI workloads- Haystack RAG, Toolformer, ChemCrow, Langchain and SWE-Agent to profile latency, throughput and energy metrics and demystify the significant impact of CPUs on these metrics relative to GPUs. We observe that - 1. Tool processing on CPUs can take up to 90.6% of the total latency; 2. Agentic throughput gets bottlenecked either by CPU factors - coherence, synchronization and over-subscription of cores or GPU factors - main memory capacity and bandwidth; \circled{3} CPU dynamic energy consumes up to 44% of the total dynamic energy at large batch sizes. Based on the profiling insights, we present two key optimizations- 1. CPU and GPU-Aware Micro-batching (CGAM) and 2. Mixed Agentic Workload Scheduling (MAWS) for homogeneous and heterogeneous agentic workloads respectively to demonstrate the potential to improve the performance, efficiency, and scalability of agentic AI. We achieve up to 2.1x and 1.41x P50 latency speedup compared to the multi-processing benchmark for homogeneous and heterogeneous agentic workloads respectively.
翻译:智能体人工智能框架在单体大型语言模型之上,增加了一个嵌入外部工具(包括网络搜索、Python解释器、上下文数据库等)的决策编排器,从而将其从被动的文本预言机转变为能够规划、调用工具、记忆过往步骤并实时适应的自主问题解决者。本文旨在从一个被长期忽视的CPU中心视角,刻画并理解智能体AI工作负载引入的系统瓶颈。我们首先基于编排器/决策组件、推理路径动态特性以及直接影响系统性能的智能体流程重复性,对智能体AI进行了系统性刻画。随后,基于此刻画,我们选择了五个代表性的智能体AI工作负载——Haystack RAG、Toolformer、ChemCrow、Langchain和SWE-Agent,来分析延迟、吞吐量和能耗指标,并揭示CPU相对于GPU对这些指标的显著影响。我们观察到:1. CPU上的工具处理可占总延迟的90.6%;2. 智能体吞吐量受限于CPU因素(核心的一致性、同步和过载)或GPU因素(主存容量和带宽);③ 在大批量处理时,CPU动态能耗可占总动态能耗的44%。基于性能剖析的洞察,我们提出了两项关键优化:1. CPU与GPU感知的微批次处理(CGAM)和2. 混合智能体工作负载调度(MAWS),分别针对同质和异质智能体工作负载,以展示提升智能体AI性能、效率和可扩展性的潜力。相较于多处理基准,我们在同质和异质智能体工作负载上分别实现了最高2.1倍和1.41倍的P50延迟加速。