Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets.
翻译:许多机器学习图书馆要求将字符串特性转换成数字表示,使模型能够按预期工作。分类字符串特性可以代表各种各样的数据(例如拉链码、姓名、婚姻状况),而且很难自动预处理。在本文件中,我们提议了一个框架,以最佳做法、域知识和新技术为基础这样做。它自动确定不同类型的字符串特性,并相应地处理,将它们编码为数字表示。我们还提供开放源码 Python 执行,以自动预处理表格数据集中的绝对字符串数据,并展示范围广泛的数据集的有希望的结果。