Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.
翻译:现代生成式与视觉语言模型(VLMs)正日益广泛地应用于科学与医疗决策支持领域,其预测概率不仅需要精确,还必须具备良好的校准性。尽管在中等规模数据下已取得显著的实证效果,但此类预测何时能在输入、类别或子群体间实现均匀泛化(而非仅在平均意义上有效)仍不明确——这对于生物医学领域尤为关键,因为在整体损失较低的情况下,罕见病症和特定群体仍可能出现较大误差。本文从有限样本视角研究该问题:在何种结构假设下,生成式与VLM基预测器能在实际样本量下实现均匀精确且校准的行为?不同于分析任意参数化模型,我们聚焦于通过在受限表示空间中变化提示词或语义嵌入所导出的分类器族。当模型输出平滑依赖于低维语义表示时(该假设得到文本谱结构与联合图文嵌入谱结构的支持),经典的均匀收敛工具可推导出有意义的非渐近保证。我们的主要结果为VLM诱导分类器的精度与校准泛函建立了有限样本均匀收敛边界,该结果基于提示词嵌入的Lipschitz稳定性假设。推导出的样本复杂度取决于内在/有效维度而非嵌入环境维度,并进一步给出谱相关边界,明确揭示特征值衰减如何主导数据需求。最后,我们探讨了该研究对数据受限生物医学建模的启示,包括当前数据集规模何时能支持均匀可靠的预测,以及为何平均校准指标可能遗漏最坏情况下的校准偏差。