Automated Machine Learning (AutoML) has gained increasing success on tabular data in recent years. However, processing unstructured data like text is a challenge and not widely supported by open-source AutoML tools. This work compares three manually created text representations and text embeddings automatically created by AutoML tools. Our benchmark includes four popular open-source AutoML tools and eight datasets for text classification purposes. The results show that straightforward text representations perform better than AutoML tools with automatically created text embeddings.
翻译:近年来,自动机学习(自动机学习)在表格数据方面取得了越来越多的成功,然而,处理像文本这样的无结构数据是一项挑战,没有开放源码自动机学习工具的广泛支持。这项工作比较了自动机学习工具自动生成的三个人工创建的文本表达和文本嵌入。我们的基准包括四个受欢迎的开放源码自动学习工具和八个数据集,用于文本分类。结果显示,直接文本表达比自动嵌入文本的工具要好,自动创建文本嵌入。