In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
翻译:在机器学习中, 我们传统上评价单一模型的性能, 平均以测试投入的收集为标准。 在这项工作中, 我们提出一种新的方法: 在用 $\ textit{ single 输入点 $ 来评估模型集的性能。 具体地说, 我们研究一个点的 $\ textit{ profile} $ : 测试分布中模型的平均性能和该点的点性能之间的关系。 我们发现, 剖面图能够对模型和数据的结构产生新的洞察力 -- -- 完全在分配中和在分配之外。 例如, 我们实验性地显示, 真正的数据分配由性质不同的点组成。 一方面, 我们用“ 可兼容的” 点与点与平均绩效之间的强烈关联。 另一方面, 我们研究一个点是弱点, 甚至是 $ textitleitle{ { production: 整体模型的准确性 $\ templain $- 10 的性能。 我们证明这些实验性观测与先前工作中提出的若干简化学习模型的预测不一致。 作为应用程序, 我们用这个剖面图用于 CI- CI- 10 10 数据测试。