Automated Machine Learning has grown very successful in automating the time-consuming, iterative tasks of machine learning model development. However, current methods struggle when the data is imbalanced. Since many real-world datasets are naturally imbalanced, and improper handling of this issue can lead to quite useless models, this issue should be handled carefully. This paper first introduces a new benchmark to study how different AutoML methods are affected by label imbalance. Second, we propose strategies to better deal with imbalance and integrate them into an existing AutoML framework. Finally, we present a systematic study which evaluates the impact of these strategies and find that their inclusion in AutoML systems significantly increases their robustness against label imbalance.
翻译:自动机器学习在使机器学习模式开发的耗时、反复的任务自动化方面已取得了很大成功。然而,当数据不平衡时,目前的方法却在挣扎。由于许多真实世界的数据集自然是不平衡的,不适当地处理这一问题可能导致相当无用的模型,因此这个问题应该谨慎处理。本文件首先提出一个新的基准,研究不同自动ML方法如何受到标签不平衡的影响。第二,我们提出了更好地处理不平衡问题并将其纳入现有的自动ML框架的战略。最后,我们提出了一项系统的研究,评估这些战略的影响,发现将其纳入自动ML系统会大大增强它们抵御标签不平衡的力度。