面向AI推理扩展的三维优化：平衡精度、成本与延迟 (3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency)

AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.

翻译：AI推理扩展通常通过一维启发式方法（固定推理轮次）或二维双变量权衡（如精度与计算量）进行调优，这些方法未能充分考虑成本与延迟约束。本文提出一种三维优化框架，在统一决策空间中联合校准精度、成本与延迟，实现约束感知的推理扩展。通过在三种典型场景和九个模拟大型语言模型上进行蒙特卡洛模拟，我们评估了四种优化方法以解决该三维多目标优化问题。将推理扩展问题构建为多目标优化问题，可形成一维和二维优化无法捕捉的可行解空间，从而支持环境自适应的推理扩展参数~$k$的选择。结果表明，基于帕累托前沿的拐点优化能实现最佳平衡，而当精度优先时，精度最大化方法仍具优势。我们的研究进一步表明，较小规模的模型结合最优推理扩展策略，能以极低的成本达到甚至超越大规模模型的性能。该框架为面向多样化部署条件的推理扩展奠定了理论基础。