Data augmentation is essential when applying Machine Learning in small-data regimes. It generates new samples following the observed data distribution while increasing their diversity and variability to help researchers and practitioners improve their models' robustness and, thus, deploy them in the real world. Nevertheless, its usage in tabular data still needs to be improved, as prior knowledge about the underlying data mechanism is seldom considered, limiting the fidelity and diversity of the generated data. Causal data augmentation strategies have been pointed out as a solution to handle these challenges by relying on conditional independence encoded in a causal graph. In this context, this paper experimentally analyzed the ADMG causal augmentation method considering different settings to support researchers and practitioners in understanding under which conditions prior knowledge helps generate new data points and, consequently, enhances the robustness of their models. The results highlighted that the studied method (a) is independent of the underlying model mechanism, (b) requires a minimal number of observations that may be challenging in a small-data regime to improve an ML model's accuracy, (c) propagates outliers to the augmented set degrading the performance of the model, and (d) is sensitive to its hyperparameter's value.
翻译:数据增强在小数据领域应用机器学习时非常重要。它生成了遵循观察到的数据分布并增加其多样性和可变性的新样本,帮助研究人员和从业者提高模型的鲁棒性,从而在真实世界中部署它们。然而,在表格数据中使用它仍需要改进,因为很少考虑底层数据机制的先验知识,从而限制了生成数据的准确性和多样性。因果数据增强策略被指出是解决这些挑战的一种方法,因为它依赖于因果图中编码的条件独立性。在这种情况下,本文通过考虑不同的设置实验分析了ADMG因果增强方法,以帮助研究人员和从业者了解在哪些条件下先验知识有助于生成新的数据,从而提高其模型的鲁棒性。结果表明,所研究的方法(a)独立于底层模型机制,(b)需要最少数量的观察数据,在小数据情况下可能具有挑战性,以提高机器学习模型的准确性,(c)将异常值传播到增强集以降低模型的性能,以及(d)对其超参数的值敏感。