We study how organizations should select among competing AI models when user utility, deployment costs, and compliance requirements jointly matter. Widely used capability leaderboards do not translate directly into deployment decisions, creating a capability--deployment gap; to bridge it, we take a systems-level view in which model choice is tied to application outcomes, operating constraints, and a capability--cost frontier. We develop ML Compass, a framework that treats model selection as constrained optimization over this frontier. On the theory side, we characterize optimal model configurations under a parametric frontier and show a three-regime structure in optimal internal measures: some dimensions are pinned at compliance minima, some saturate at maximum levels, and the remainder take interior values governed by frontier curvature. We derive comparative statics that quantify how budget changes, regulatory tightening, and technological progress propagate across capability dimensions and costs. On the implementation side, we propose a pipeline that (i) extracts low-dimensional internal measures from heterogeneous model descriptors, (ii) estimates an empirical frontier from capability and cost data, (iii) learns a user- or task-specific utility function from interaction outcome data, and (iv) uses these components to target capability--cost profiles and recommend models. We validate ML Compass with two case studies: a general-purpose conversational setting using the PRISM Alignment dataset and a healthcare setting using a custom dataset we build using HealthBench. In both environments, our framework produces recommendations -- and deployment-aware leaderboards based on predicted deployment value under constraints -- that can differ materially from capability-only rankings, and clarifies how trade-offs between capability, cost, and safety shape optimal model choice.
翻译:本研究探讨当用户效用、部署成本与合规要求共同影响时,组织应如何在竞争性AI模型中进行选择。广泛使用的能力排行榜无法直接转化为部署决策,形成了能力与部署之间的鸿沟;为弥合此鸿沟,我们采用系统级视角,将模型选择与应用成果、操作约束及能力-成本前沿联系起来。我们开发了ML Compass框架,将模型选择视为对该前沿的约束优化问题。在理论层面,我们刻画了参数化前沿下的最优模型配置结构,揭示了最优内部度量的三区域特征:部分维度被固定在合规下限,部分维度饱和于最高水平,其余维度则受前沿曲率调控而取中间值。我们推导的比较静态量化了预算变化、监管收紧和技术进步如何影响能力维度与成本。在实施层面,我们提出一个流程:(i)从异构模型描述符中提取低维内部度量;(ii)根据能力与成本数据估计经验前沿;(iii)从交互结果数据中学习用户或任务特定的效用函数;(iv)利用这些组件定位能力-成本配置并推荐模型。我们通过两个案例验证ML Compass:使用PRISM Alignment数据集的通用对话场景,以及使用基于HealthBench构建的自定义数据集的医疗场景。在这两种环境中,我们的框架生成的推荐结果——以及基于约束下预测部署价值构建的部署感知排行榜——可能与纯能力排名存在显著差异,并阐明了能力、成本与安全性之间的权衡如何影响最优模型选择。