评估预测系统并减少其警报负担的新技术 (Novel Techniques to Assess Predictive Systems and Reduce Their Alarm Burden)

The performance of a binary classifier ("predictor") depends heavily upon the context ("workflow") in which it operates. Classic measures of predictor performance do not reflect the realized utility of predictors unless certain implied workflow assumptions are met. Failure to meet these implied assumptions results in suboptimal classifier implementations and a mismatch between predicted or assessed performance and the actual performance obtained in real-world deployments. The mismatch commonly arises when multiple predictions can be made for the same event, the event is relatively rare, and redundant true positive predictions for the same event add little value, e.g., a system that makes a prediction each minute, repeatedly issuing interruptive alarms for a predicted event that may never occur. We explain why classic metrics do not correctly represent the performance of predictors in such contexts, and introduce an improved performance assessment technique ("u-metrics") using utility functions to score each prediction. U-metrics explicitly account for variability in prediction utility arising from temporal relationships. Compared to traditional performance measures, u-metrics more accurately reflect the real-world benefits and costs of a predictor operating in a workflow context. The difference can be significant. We also describe the use of "snoozing," a method whereby predictions are suppressed for a period of time, commonly improving predictor performance by reducing false positives while retaining the capture of events. Snoozing is especially useful when predictors generate interruptive alerts, as so often happens in clinical practice. Utility-based performance metrics correctly predict and track the performance benefits of snoozing, whereas traditional performance metrics do not.

翻译：二进制分类器的性能(“预测器”)在很大程度上取决于其运作的上下文(“工作流” ) 。典型的预测器性能标准并不反映预测器的实际效用,除非某些隐含的工作流程假设得到满足。不符合这些隐含的假设导致不优化分类执行,预测或评估的性能与现实世界部署中的实际性能不匹配。当对同一事件作出多重预测时,通常会出现不匹配。事件相对较少, 同一事件的传统真实性预测的多余性真实性预测增加了很少价值, 例如, 一种每分钟作出预测的系统, 经常为预测事件发出中断的警报, 而这种预测事件可能永远不会发生。我们解释为什么经典的度量度指标不正确代表了预测器在这种环境下的性能, 采用改进的性能评估技术(“计量” ) 来对每个预测值进行评分。与传统性能测量相比, 与传统的性能计量相比, u- 更准确地反映真实性能的效益和成本, 在一个预测器运行的周期内, 运行的性性性性性能是显著的性能, 。运行的性能运行的性能, 运行过程的性能的性能的性能是显著性能,, 运行的性能, 运行的性能的性能的性能是, 的性能的性能的性能的性能的性能的性能的性能的性能的性能是,,, 的性能, 的性能的性能的性能的性能,, 的性能, 的性能, 的性能, 的性能的性能的性能的性能的性能的性能, 的性能的性能的性能, 的性能, 的性能, 的性能, 的性能的性能的性能的性能, 的性能的性能, 的性能, 的性能的性能, 的性能的性能的性能, 的性能的性能, 的性能, 的性能, 的性能, 的性能, 的性能, 的性能, 的性能