Data augmentation is one of the most successful techniques to improve the classification accuracy of machine learning models in computer vision. However, applying data augmentation to tabular data is a challenging problem since it is hard to generate synthetic samples with labels. In this paper, we propose an efficient classifier with a novel data augmentation technique for tabular data. Our method called CCRAL combines causal reasoning to learn counterfactual samples for the original training samples and active learning to select useful counterfactual samples based on a region of uncertainty. By doing this, our method can maximize our model's generalization on the unseen testing data. We validate our method analytically, and compare with the standard baselines. Our experimental results highlight that CCRAL achieves significantly better performance than those of the baselines across several real-world tabular datasets in terms of accuracy and AUC. Data and source code are available at: https://github.com/nphdang/CCRAL.
翻译:增强数据是提高计算机视野中机器学习模型分类准确性的最成功技术之一。然而,对表格数据应用数据增强是一个棘手的问题,因为很难用标签生成合成样本。在本文中,我们提议一个高效的分类器,为表格数据提供新的数据增强技术。我们称为CCRAEL的方法结合了因果推理,学习原始培训样本的反事实样本,并积极学习根据不确定区域选择有用的反事实样本。通过这样做,我们的方法可以最大限度地扩大我们模型对无形测试数据的概括性。我们用分析方法验证了我们的方法,并与标准基线进行比较。我们的实验结果显示,CACRAL在准确性和ACUC方面比几个真实世界的表格数据集的基线业绩要好得多。数据和源代码见:https://github.com/nphdang/CCACAL。