Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.
翻译:尽管在机器学习(ML)方面取得了巨大进步,但数据不平衡的培训仍然在许多现实世界应用中构成挑战。在解决该问题的一系列不同技术中,抽样算法被视为一种有效的解决办法。然而,问题更为根本,许多工作都强调实例硬性的重要性。这个问题涉及管理不安全或潜在噪音案例的重要性,这些案例更有可能被错误分类,并成为分类性能不佳的根源。本文介绍Hard Vis,这是一个视觉分析系统,主要用于处理不平衡的分类假设中的实例准确性。我们提议的系统协助用户对数据类型的不同分布进行视觉比较,根据以后将受到主动抽样方法影响的当地特点选择实例类型,并验证来自低抽样或过度抽样技术的建议对ML模式的益处。此外,我们让用户找到和抽样调查所有类别中收到的培训案例的有用性和难度。用户可以探索不同视角来决定所有这些参数的子集,而Hard Vision系统则跟踪其最后步骤或过度抽样技术的哪些建议对MVL模式的准确性能,我们从最后的预测性能评估了一种结果。