The rapid advancement of Artificial Intelligence (AI) has created unprecedented demands for computational power, yet methods for evaluating the performance, efficiency, and environmental impact of deployed models remain fragmented. Current approaches often fail to provide a holistic view, making it difficult to compare and optimise systems across heterogeneous hardware, software stacks, and numeric precisions. To address this gap, we propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics under realistic serving conditions. Our framework provides a pragmatic, carbon-aware evaluation by systematically measuring latency and throughput distributions, energy consumption, and location-adjusted carbon emissions, all while maintaining matched accuracy constraints for valid comparisons. We apply this methodology to multi-precision models across diverse hardware platforms, from data-centre accelerators like the GH200 to consumer-level GPUs such as the RTX 4090, running on mainstream software stacks including PyTorch, TensorRT, and ONNX Runtime. By systematically categorising these factors, our work establishes a rigorous benchmarking framework that produces decision-ready Pareto frontiers, clarifying the trade-offs between accuracy, latency, energy, and carbon. The accompanying open-source code enables independent verification and facilitates adoption, empowering researchers and practitioners to make evidence-based decisions for sustainable AI deployment.
翻译:人工智能(AI)的快速发展对计算能力产生了前所未有的需求,然而,评估已部署模型的性能、效率及环境影响的方法仍然零散。现有方法通常无法提供整体视图,使得在异构硬件、软件栈及数值精度下比较和优化系统变得困难。为弥补这一不足,我们提出了一种统一且可复现的AI模型推理方法学,该方法在实际服务条件下整合了计算与环境度量指标。我们的框架通过系统性地测量延迟与吞吐量分布、能耗以及基于地理位置调整的碳排放,同时保持匹配的精度约束以确保有效比较,从而提供了一种务实的、具备碳感知能力的评估方案。我们将此方法应用于跨多种硬件平台的多精度模型,从数据中心加速器(如GH200)到消费级GPU(如RTX 4090),并在包括PyTorch、TensorRT和ONNX Runtime在内的主流软件栈上运行。通过对这些因素进行系统性分类,我们的工作建立了一个严谨的基准测试框架,该框架能生成可直接用于决策的帕累托前沿,从而阐明精度、延迟、能耗与碳排放之间的权衡关系。随附的开源代码支持独立验证并促进方法采纳,使研究人员和实践者能够为可持续的AI部署做出基于证据的决策。