干货｜分分钟学懂机器学习如何净化处理文本！

会员服务 ·

干货｜分分钟学懂机器学习如何净化处理文本！

2017 年 10 月 21 日 全球人工智能 Jason Brownlee

“全球人工智能”拥有十多万AI产业用户，10000多名AI技术专家。主要来自：北大，清华，中科院，麻省理工，卡内基梅隆，斯坦福，哈佛，牛津，剑桥...以及谷歌，腾讯，百度，脸谱，微软，阿里，海康威视，英伟达......等全球名校和名企。

——免费加入AI技术专家社群>>

——免费加入AI高管投资者群>>

——申请成为AI高校推广大使>>

摘要：通过本教程，你可以学到如何开发简单的文本净化工具，如何使用NLTK库中更复杂的方法，以及在使用现代文字表示方法时如何处理文本。

你不能直接把原始文本提交给机器学习或深层学习模型，而必须首先对文本进行净化，也就是将文本分解成单词，以及处理标点符号和大小写。

事实上，你需要使用一整套的文本预处理方法，而且这个方法的选择取决于你需要对自然语言做何种处理。

在本教程中，你将学到如何为机器学习建模而净化和处理文本，包括：

如何开发简单的文本净化工具。
如何使用NLTK库中更复杂的方法。
在使用现代文字表示方法时如何处理文本。

让我们开始吧。

教程概述

本教程包含六个部分，分别为：

弗兰茨·卡夫卡的《变形记》
文本净化是一件依赖于具体任务的工作
手动标记
使用NLTK进行标记和净化
文本净化注意事项

弗兰茨·卡夫卡的《变形记》

首先选择一个数据集。

本教程使用了弗兰茨·卡夫卡的《变形记》一书中的文字。选这本书中的文字并没有什么具体的原因，除了它比较短以外。我很喜欢这本书，希望你也会喜欢。我期望它是学生们必读的经典之作之一。

《变形记》全文可以从Gutenberg项目免费获得。

Gutenberg项目上的《变形记》（弗兰茨·卡夫卡）

你也可以在这里下载ASCII文本版：

《变形记》（弗兰茨·卡夫卡）的UTF-8文本版。

下载该文件，并将其放在你当前的工作目录中，文件名为“*metamorphosis.txt*“。

该文件包含了我们不感兴趣的页眉和页脚，特别是版权和授权信息。请打开文件，删除页眉和页脚，并将文件另存为“*metamorphosis_clean.txt*“。

这个干净文件的开头应该是这样的：

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

文件的结尾应该是这样的：

And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.

文本净化是一件依赖于具体任务的工作

在拿到了文本数据之后，清理文本数据的第一步是为了让你对要实现的目标有一个清晰的概念。

仔细看看这篇文章。你能注意到哪些东西？

这是我发现的：

它是纯文本，所以没有标记需要解析。
原始德语的翻译使用的是英国英语（例如“*travelling*“）。
文本被人工断行，每一行70个字符。
没有明显的打印错误或拼写错误。
存在一些标点符号，如逗号、撇号、引号、问号等等。
存在连字符，如“armour-like”。
有很多地方都用破折号（“ - ”）来继续句子（是否可以用逗号替代）？
存在很多人名（例如“*Mr. Samsa*”）
似乎没有需要处理的数字（例如1999）
存在段标记字符（例如“II”和“III”），之前已经删除了第一个“I”。

我相信有很多双训练有素的眼睛能观察到这些细节问题。

下文将展示最常见的文本净化步骤。不过，请思考一下在处理这个文本文件时可能会遇到的一些小问题。

例如：

如果对开发Kafkaesque语言模型感兴趣，那么可以保留所有的大小写、引号和其他标点符号。
如果对将文档归类为“*Kafka*”和“*非Kafka*”感兴趣，也许可以去掉大小写、标点符号。

请根据你要完成的任务来选择如何处理文本数据。

手动标记

文字净化很难，而本教程使用的文字已经很干净了。

我们可以编写一些Python代码来手动净化它，这对于遇到的那些简单问题来说是一个不错的处理方法。而诸如正则表达式和分割字符串的工具则可能需要耗费你较多的时间。

1. 加载数据

现在我们来加载文本数据吧。

这个文本文件很小，加载到内存的速度很快。但并不是所有的文本文件都会这么小，你可能需要写代码将内存映射到文件上。像NLTK这样的工具（下一节将介绍）能简化对大型文件的处理。

将“*metamorphosis_clean.txt*”整个加载到内存中，如下所示：

# load textfilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')
text = file.read()file.close()

运行该示例将整个文件加载到内存中。

2. 按空格分隔

干净的文本通常意味着可以在机器学习模型中使用的单词或标记列表。因此，我们需要将原始文本转换为单次列表，并保存下来。

最简单的方法就是将文档按空格进行分割，包括引号、新的一行、制表符等等。我们可以在Python中对加载的字符串使用split()函数。

# load textfilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into words by white spacewords = text.split()
print(words[:100])

运行这个示例可以将文档分割成一个很长的列表，然后打印前100个元素。

可以看到，标点符号被保留下来了（例如“*wasn’t*”和“*armour-like*”），这很好。还还可以看到，句尾的标点符号与最后一个单词放在了一起，没有分割开（例如“*thought*.”），这不太好。

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']

3. 选择单词

另一种方法是使用正则表达式模型，并通过使用字母数字过滤字符串（a-z，A-Z，0-9和‘_’）将文档分割成单词。

例如：

# load textfilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split based on words onlyimport rewords = re.split(r'\W+', text)
print(words[:100])

运行该示例，可以看到最终的单词列表。这一次，“*armour-like*”变成了两个单词“*armour*”和“*like*”（很好），但缩略词，像“*What’s*”也变成了两个单词“*What*”和“*s*”（不是很好）。

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']

3. 按空格分割并删除标点符号

注意：本示例是用Python 3编写的。

我们想要的是单词，而不是标点符号，比如逗号或引号。我们也希望缩略词不要被分割开。

一种方法是将文档按空格进行分割（在“2. 按空格分隔”中提到的），然后使用字符串转换来替换所有标点符号（例如删除标点符号）。

Python提供了一个名为*string.punctuation*的常量，它是所有标点符号列表。例如：

print(string.punctuation)

结果是：

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Python提供了一个名为translate()的函数，它可以将一组字符映射到另一组。

可以使用函数maketrans()来创建一个映射表。这个函数的第三个参数用于列出在翻译过程中要删除的所有字符。例如：

table = str.maketrans('', '', string.punctuation)

我们可以将上面这些代码放在一起，加载文本文件，将其按空格分割成单词，然后转换每个单词以删除标点符号。

# load textfilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into words by white spacewords = text.split()# remove punctuation from each wordimport stringtable = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:100])

在通常情况下，可以看到这已经达到了预期的效果。

诸如“*What’s*”这样的缩略语已经变成了“*Whats*”，而“*armour-like*”已经变成了“*armourlike*”。

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

4. 规范大小写

将所有单词转换为统一的大小写很常见。这能减少单词量，但也会丢失某些差异（例如，“*Apple*”公司和“*apple*”水果是最常见的例子）。

可以对每个单词调用lower()函数来将所有的单词转换为小写。

例如：

filename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into words by white spacewords = text.split()# convert to lower casewords = [word.lower() for word in words]
print(words[:100])

运行示例，可以看到所有的单词现在都变成小写了。

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']

注意

净化文本真的很难，需要具体问题具体分析。记住，越简单越好。文本数据越简单，则模型越简单，词汇表更小。

下面来看一下NLTK库中的一些工具，可不仅仅是简单的字符串拆分哦。

使用NLTK进行标记和净化

Natural Language Toolkit（自然语言工具包）或简称NLTK是一个对文本进行处理和建模的Python库。

它提供了不错的加载和净化文本工具，我们可以用这些工具来为机器学习和深度学习算法获取数据。

1. 安装 NLTK

你可以使用你最喜欢的软件包管理器来安装NLTK，例如pip：

sudo pip install -U nltk

安装完成之后，你还需要安装与库一起配套使用的数据，其中包含了大量的文档，你可以使用这些文档来测试NLTK中的其他工具。

有多种方法来安装数据和文档，例如，用脚本：

import nltknltk.download()

或用命令行：

python -m nltk.downloader all

有关安装和设置NLTK的更多帮助，请参阅：

安装 NLTK
安装 NLTK 数据

2. 分割成句子

第一步是将文本分割成句子。

一些建模任务倾向于以段落或句子的形式输入文本，例如word2vec。你可以先将文本分割成句子，再将每个句子分割成单词，然后将每个句子保存到文件中，每行一个句子。

NLTK提供的*sent_tokenize()*函数可以将文本分割成句子。

下面的示例将“*metamorphosis_clean.txt*”文件加载到内存中，将其分割成句子，并打印第一个句子。

# load datafilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into sentencesfrom nltk import sent_tokenizesentences = sent_tokenize(text)
print(sentences[0])

运行这个示例，可以看到文档被分割成了句子。

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.

3. 分割成单词

NLTK提供了一个名为*word_tokenize()*的函数，可用于将字符串分割成标记（也就是单词）。

它根据空格和标点符号进行分割。例如，逗号和句点被视为单独的标记。而缩略语也会分割开（例如“*What’s*”变成“*What*”和“*’s*“）。引号会被保留。

例如：

# load datafilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into wordsfrom nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

运行这段代码，可以看到标点符号现在已经成为了标记，后面可以决定是否要过滤掉。

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']

4. 过滤标点符号

我们可以过滤掉我们不感兴趣的标记，例如所有独立的标点符号。

通过遍历所有标记并仅保留所有字母的标记可以实现这个目的。在Python中，isalpha()这个函数很有用。

例如：

# load datafilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into wordsfrom nltk.tokenize import word_tokenize
tokens = word_tokenize(text)# remove all tokens that are not alphabeticwords = [word for word in tokens if word.isalpha()]
print(tokens[:100])

运行这个示例，你可以看到不仅标点符号，而且“*armour-like*”和“*‘s*”也被过滤掉了。

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

5. 过滤掉停止词

停止词是指那些对这个词语的深层含义没有贡献的词。

这些是最常见的停止词：“*the*”，“*a*”和“*is*”。

对于某些应用（如文档分类）来说，删除停止词非常必要。

NLTK提供了各种语言（如英语）最常用的停止词列表，可以像如下代码那样加载：

from nltk.corpus import stopwords
stop_words = stopwords.words('english')print(stop_words)

你可以看到完整的列表：

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

可以看到，它们都是小写字母，并且标点符号已被删除。你可以将你的标记与停止词进行比较，并将其过滤掉。

下面来演示一下这个过程：

加载原始文本。
分成多个标记。
转换为小写。
从每个标记中删除标点符号。
滤除不是字母的标记。
过滤掉停止词。

# load datafilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into wordsfrom nltk.tokenize import word_tokenize
tokens = word_tokenize(text)# convert to lower casetokens = [w.lower() for w in tokens]# remove punctuation from each wordimport stringtable = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]# remove remaining tokens that are not alphabeticwords = [word for word in stripped if word.isalpha()]# filter out stop wordsfrom nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))words = [w for w in words if not w in stop_words]
print(words[:100])

运行这个例子，可以看到除了其他的转换之外，像“*a*”和“*to*”这样的停止词已被删除。但是还留下了像“*nt*”这样的标记。革命尚未成功，同志仍须努力。

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer']

6. 词干单词

词干提取是指抽取每个单词的词干或词根的过程。例如，“*fishing*,”、“*fished*,”、“*fisher*”都可以缩减为“*fish*”。

目前有很多的词干抽取算法，但最流行的是Porter Stemming算法。该方法可以通过PorterStemmer类在NLTK中使用。

例如：

# load datafilename = 'metamorphosis_clean.txt'file = open(filename, 'rt')text = file.read()file.close()# split into wordsfrom nltk.tokenize import word_tokenize
tokens = word_tokenize(text)# stemming of wordsfrom nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

运行这个例子，可以看到很多单词都已经被抽取了词干，比如，“*trouble*”已经变成“*troubl*”。而且，词干提取还使标记变为小写。

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s", 'happen', 'to'