Serving Large Language Models (LLMs) efficiently has become crucial. LLMs are often served with multiple devices using techniques like data, pipeline, and tensor parallelisms. Each parallelism presents trade-offs between computation, memory, and communication overhead, making it challenging to determine the optimal parallel execution plan. Moreover, input workloads also impact parallelism strategies. Tasks with long prompts like article summarization are compute-intensive, while tasks with long generation lengths like code generation are often memory-intensive; these differing characteristics result in distinct optimal execution plans. Since searching for the optimal plan via actual deployment is prohibitively expensive, we propose APEX, an LLM serving system simulator that efficiently identifies an optimal parallel execution plan. APEX captures the complex characteristics of iteration-level batching, a technique widely used in SOTA LLM serving systems. APEX leverages the repetitive structure of LLMs to reduce design space, maintaining a similar simulation overhead, even when scaling to trillion scale models. APEX supports a wide range of LLMs, device clusters, etc., and it can be easily extended through its high-level templates. We run APEX simulations using a CPU and evaluate the identified optimal plans using 8 H100 GPUs, encompassing a wide range of LLMs and input workloads. We show that APEX can find optimal execution plans that are up to 4.42x faster than heuristic plans in terms of end-to-end serving latency. APEX also reports a set of metrics used in LLM serving systems, such as time per output token and time to first token. Furthermore, APEX can identify an optimal parallel execution plan within 15 minutes using a CPU. This is 71x faster and 1234x more cost-effective than actual deployment on a GPU cluster using cloud services. APEX will be open-sourced upon acceptance.
翻译:暂无翻译