对WD-FAB IRT模型进行正规化的贝耶斯校准和评分,使预测性业绩比边缘最大可能性提高 (Regularized Bayesian calibration and scoring of the WD-FAB IRT model improves predictive performance over marginal maximum likelihood)

Item response theory (IRT) is the statistical paradigm underlying a dominant family of generative probabilistic models for test responses, used to quantify traits in individuals relative to target populations. The graded response model (GRM) is a particular IRT model that is used for ordered polytomous test responses. Both the development and the application of the GRM and other IRT models require statistical decisions. For formulating these models (calibration), one needs to decide on methodologies for item selection, inference, and regularization. For applying these models (test scoring), one needs to make similar decisions, often prioritizing computational tractability and/or interpretability. In many applications, such as in the Work Disability Functional Assessment Battery (WD-FAB), tractability implies approximating an individual's score distribution using estimates of mean and variance, and obtaining that score conditional on only point estimates of the calibrated model. In this manuscript, we evaluate the calibration and scoring of models under this common use-case using Bayesian cross-validation. Applied to the WD-FAB responses collected for the National Institutes of Health, we assess the predictive power of implementations of the GRM based on their ability to yield, on validation sets of respondents, ability estimates that are most predictive of patterns of item responses. Our main finding indicates that regularized Bayesian calibration of the GRM outperforms the regularization-free empirical Bayesian procedure of marginal maximum likelihood. We also motivate the use of compactly supported priors in test scoring.

翻译：项响应理论(IRT)是测试反应的基因化概率模型的主要大家庭的统计范式,用来量化个人相对于目标人群的特征。分级响应模型(GRM)是用于有秩序的多式测试反应的特定的IRT模型。GRM和其他RT的模型的开发和应用都需要统计决定。这些模型(校准)的制定需要决定项目选择、推断和规范化的方法。为了应用这些模型(测试评分),需要做出类似的决定,常常优先考虑计算性可容性和/或可解释性。在许多应用中,例如工作残疾功能评估Battery(WD-FAB),可移用意味着使用平均值和差异估计数来进行个人得分分布的比值分配,而获得的得分仅以校准模型的点估计值为条件。我们用Bayesian 交叉校准这些模型下的模型的校准和评分。在为国家卫生研究所收集的WD-FAB反应中,也把计算的计算概率和(W-FAB)定期功能评估(W-FAB)定期评估B)的定期评标评分能力模型中,我们根据GRM的评标结果预测了对结果的预测能力,我们根据主的评标定能力预测,我们预测了RM的测测测测测测测的比能力。