Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, LLM-Judge, semantic similarity, etc.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.
翻译:现有工具增强型代理系统在现实应用中存在以下局限:(i) 黑箱推理步骤削弱决策可信度并带来安全风险;(ii) 多模态融合能力不足,而这在医疗任务中至关重要;(iii) 代理流程僵化且计算效率低下。本文提出PASS(概率代理超网络采样)——首个针对胸部X光推理场景的多模态框架,旨在系统性解决上述挑战。PASS通过在多工具图谱上自适应采样代理工作流,生成带有可解释概率标注的决策路径。面对多模态医疗数据构成的复杂CXR推理任务,PASS利用其学习到的任务条件化超网络分布,在每层超网络中自适应选择最合适的工具,不仅为事后审计提供概率标注轨迹以直接增强医疗AI安全性,还能持续将关键发现压缩至动态演化的个性化记忆库中。该框架通过动态决策机制,在深化推理路径与提前终止计算之间实现效率平衡。为优化性能与成本的帕累托前沿,我们设计了包含专家知识预热、对比路径排序和成本感知强化学习的三阶段训练流程。为建立严谨评估体系,本文提出CAB-E基准测试集,用于评估多步骤、安全敏感、自由形式的CXR推理任务。多基准实验表明,PASS在准确率、LLM-Judge评分、语义相似度等指标上显著超越现有基线方法,同时保持计算成本平衡,为推动可解释、自适应、多模态医疗代理系统的范式转变提供了新路径。