Regions of high-dimensional input spaces that are underrepresented in training datasets reduce machine-learnt classifier performance, and may lead to corner cases and unwanted bias for classifiers used in decision making systems. When these regions belong to otherwise well-represented classes, their presence and negative impact are very hard to identify. We propose an approach for the detection and mitigation of such rare subclasses in deep neural network classifiers. The new approach is underpinned by an easy-to-compute commonality metric that supports the detection of rare subclasses, and comprises methods for reducing the impact of these subclasses during both model training and model exploitation. We demonstrate our approach using two well-known datasets, MNIST's handwritten digits and Kaggle's cats/dogs, identifying rare subclasses and producing models which compensate for subclass rarity. In addition we demonstrate how our run-time approach increases the ability of users to identify samples likely to be misclassified at run-time.
翻译:在培训数据集中代表性不足的高维输入空间区域减少机 Learn分类员的性能,并可能导致在决策系统中使用分类员的偏角案例和不必要的偏见。当这些区域属于其他代表性强的类别时,很难确定它们的存在和负面影响。我们建议了一种方法,以探测和减缓深神经网络分类员中这种稀有的子类。新的方法以易于计算的共同性指标为基础,支持探测稀有子类,并包含在模型培训和模型开发期间减少这些子类影响的方法。我们用两种众所周知的数据集,即MNIST的手写数字和Kaggle的猫犬/狗,我们展示了我们的方法,确定了稀有的子类,并制作模型,以弥补子类的稀有性。此外,我们展示了我们的运行时间方法如何提高用户识别在运行时可能误分类的样品的能力。