Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.
翻译:不同自动洗钱框架的比较具有众所周知的挑战性,而且往往不正确。我们在比较自动洗钱框架时引入了一个开放和可扩展的基准,遵循最佳做法,避免常见的错误。我们全面比较了9个众所周知的自动洗钱框架,共涉及71个分类和33个回归任务。通过多层面分析,对自动洗钱框架之间的差异进行了探讨,评估了模型准确性,在推论时间和框架失败方面的权衡取舍。我们还利用布拉德利-Termy树来发现相对自动洗钱框架排名不同的任务子集。基准是一个开放源工具,与许多自动洗钱框架相结合,并将经验评估过程的端到端自动化:从框架安装和资源分配到深入评价。基准使用公共数据集,很容易与其他自动洗钱框架和任务一起扩展,并有一个具有最新结果的网站。