This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a `betting strategy' against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup.We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results.
翻译:本条提出了用于评估二进制事件概率预测校准的HL(HL)测试的替代物。 这种方法以电子价值为基础, 是一种新的假设测试工具。 电子价值是一种随机变量, 其预期值小于或等于空假设下的任意值。 大电子价值提供证据反对无效假设, 电子价值的倍数反射是一种 p- 值 。 我们的测试使用在线等离子回归法来估计校准曲线, 将其作为一种“ 推导策略 ” 来对抗无效假设。 我们显示, 测试基本上具有所有替代物的威力, 使测试在理论上优于 HL 测试, 同时解决后者众所周知的不稳定问题。 模拟研究表明, 拟议的eHL 测试的可行版本可以检测出与实际相关的样本大小的轻微误差, 但其普遍有效性和能量保证与经典模拟设置的HL 测试相比, 相比被降低的经验性能力。 我们演示了我们在台湾信用卡危机期间对经重新校准的信用卡违约预测的测试, 典型的HL 测试结果 。</s>