Traditional evaluation metrics for classification in natural language processing such as accuracy and area under the curve fail to differentiate between models with different predictive behaviors despite their similar performance metrics. We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level to provide insights into disparities in their decision-making logic. We assess the sensitivity score on a set of representative words in the test set using two classifiers trained for hospital readmission classification with similar performance statistics. Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores. The results indicate that the language model's sensitivity score aligns better with the professionals than the xgboost classifier on tf-idf embeddings, which suggests that xgboost uses some spurious features. Overall, this metric offers a novel perspective on assessing models' robustness by quantifying their discrepancy with professional opinions. Our code is available on GitHub (https://github.com/nyuolab/Model_Sensitivity).
翻译:在自然语言处理中进行分类的传统评价指标,例如精度和曲线下区域,没有区分具有不同预测行为的模型,尽管其性能指标相似。我们引入敏感度评分,这是在词汇一级审查模型的行为,以洞察其决策逻辑的差异。我们利用经过医院重新接纳分类培训的两个分类员和类似的性能统计,评估测试组中一套具有代表性的词的敏感度评分。我们的实验比较了临床医生和分类员的决策逻辑,根据敏感度分的等级相关性加以比较。结果显示,语言模型的敏感度评分与专业人员的敏感度比tf-idf嵌入的xgboost分类员更一致,这表明xgbost使用了一些虚假的特征。总体而言,这一指标为通过量化其与专业意见的差异来评估模型的稳健性提供了新的视角。我们的代码可在GitHub上查阅(https://github.com/nyuolab/Model_Sensitivial)。