从字符串到数据科学:自动字符串处理实用框架 (From Strings to Data Science: a Practical Framework for Automated String Handling)

Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets.

翻译：许多机器学习图书馆要求将字符串特性转换成数字表示,使模型能够按预期工作。分类字符串特性可以代表各种各样的数据(例如拉链码、姓名、婚姻状况),而且很难自动预处理。在本文件中,我们提议了一个框架,以最佳做法、域知识和新技术为基础这样做。它自动确定不同类型的字符串特性,并相应地处理,将它们编码为数字表示。我们还提供开放源码 Python 执行,以自动预处理表格数据集中的绝对字符串数据,并展示范围广泛的数据集的有希望的结果。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【数据科学导论书】Introduction to Datascience，253页pdf

专知会员服务

50+阅读 · 2021年11月15日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日