AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.
翻译:AI推理扩展通常通过一维启发式方法(固定推理轮次)或二维双变量权衡(如精度与计算量)进行调优,这些方法未能充分考虑成本与延迟约束。本文提出一种三维优化框架,在统一决策空间中联合校准精度、成本与延迟,实现约束感知的推理扩展。通过在三种典型场景和九个模拟大型语言模型上进行蒙特卡洛模拟,我们评估了四种优化方法以解决该三维多目标优化问题。将推理扩展问题构建为多目标优化问题,可形成一维和二维优化无法捕捉的可行解空间,从而支持环境自适应的推理扩展参数~$k$的选择。结果表明,基于帕累托前沿的拐点优化能实现最佳平衡,而当精度优先时,精度最大化方法仍具优势。我们的研究进一步表明,较小规模的模型结合最优推理扩展策略,能以极低的成本达到甚至超越大规模模型的性能。该框架为面向多样化部署条件的推理扩展奠定了理论基础。