Vision models often fail systematically on groups of data that share common semantic characteristics (e.g., rare objects or unusual scenes), but identifying these failure modes is a challenge. We introduce AdaVision, an interactive process for testing vision models which helps users identify and fix coherent failure modes. Given a natural language description of a coherent group, AdaVision retrieves relevant images from LAION-5B with CLIP. The user then labels a small amount of data for model correctness, which is used in successive retrieval rounds to hill-climb towards high-error regions, refining the group definition. Once a group is saturated, AdaVision uses GPT-3 to suggest new group descriptions for the user to explore. We demonstrate the usefulness and generality of AdaVision in user studies, where users find major bugs in state-of-the-art classification, object detection, and image captioning models. These user-discovered groups have failure rates 2-3x higher than those surfaced by automatic error clustering methods. Finally, finetuning on examples found with AdaVision fixes the discovered bugs when evaluated on unseen examples, without degrading in-distribution accuracy, and while also improving performance on out-of-distribution datasets.
翻译:具有共同语义特征的一组数据(例如,稀有物体或异常场景)往往系统失灵,愿景模型往往在具有共同语义特征的一组数据上系统失灵(例如,稀有物体或异常场景),但查明这些失败模式是一项挑战。我们引入了Ada Vision,这是一个测试视觉模型的互动过程,帮助用户识别和固定连贯的失败模式。鉴于一个连贯的组的自然语言描述,Ada Vision用CLIP从LAION-5B中检索到相关图像。用户然后贴上少量模型正确性数据标签,用于山坡坡向高纬度区域连续的检索周期,从而改进群体定义。一旦一个组饱和,Ada Vision使用GPT-3为用户提供新的群体描述建议。我们展示了Ada Vision在用户研究中的有用性和一般性,用户在最新分类、对象探测和图像说明模型中发现重大错误。这些用户发现的组的失败率比自动错误组合方法表面高2-3x。最后,对与Adavision发现的例子进行微调,在评估时对已发现的错误进行校正的错误校正的错误校正,同时不退化数据。