polyglot：Pipeline 多语言NLP工具

2018 年 12 月 11 日 AINLP

作者：知道创宇IA-Lab 岳永鹏

专栏地址：http://www.52nlp.cn/author/befeng

目前，在NLP任务处理中，Python支持英文处理的开源包有NLTK、Scapy、StanfordCoreNLP、GATE、OPenNLP，支持中文处理的开源工具包有Jieba、ICTCLAS、THU LAC、HIT LTP，但是这些工具大部分仅对特定类型的语言提供支持。本文将介绍功能强大的支持Pipeline方式的多语言处理Python工具包:polyglot。该项目最早是由AboSamoor在2015年3月16日在GitHub上开源的项目，已经在Github收集star 1021个。

Free software: GPLv3 license
Documentation: http://polyglot.readthedocs.org
GitHub: https://github.com/aboSamoor/polyglot

特征

语言检测 Language Detection (支持196种语言)
分句、分词 Tokenization (支持165种语言)
实体识别 Named Entity Recognition (支持40种语言)
词性标注 Part of Speech Tagging(支持16种语言)
情感分析 Sentiment(支持136种语言)
词嵌入 Word Embeddings(支持137种语言)
翻译 Transliteration(支持69种语言)
管道 Pipelines

安装

从PyPI安装/升级

$ pip install polyglot

安装polyglot依赖于numpy和 libicu-dev，在 ubuntu / debian linux发行版中你可以通过执行以下命令来安装这样的包：
$ sudo apt-get install python-numpy libicu-dev
安装成功以后，输入

$ import polyglot
$ polyglot.__version__
$ 16.07.04

数据

在随后的实例演示中，将以中文、英文或中英文混合语句作为测试数据。

text_en = u"Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."text_cn = u" 日本最后一家寻呼机服务营业商宣布，将于2019年9月结束服务，标志着日本寻呼业长达50年的历史正式落幕。目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务，该公司在20年前就已停止生产寻呼机。"text_mixed = text_cn + text_en

语言检测 Language Detection

polyglot的语言检测依赖pycld2和cld2,其中cld2是Google开发的多语言检测应用。

Example

导入依赖

from polyglot.detect import  Detector

语言类型检测

>>> Detector(text_cn).language
 name: Chinese     code: zh       confidence:  99.0 read bytes:  1996>>>> Detector(text_en).language
 name: English     code: en       confidence:  99.0 read bytes:  1144>>> Detector(text_mixed).language
 name: Chinese     code: zh       confidence:  50.0 read bytes:  1996

对中英文混合的text_mixed,其识别的语言是中文，但置信度（confidence）仅有50，所有包含的语言类型检测

>>> for language in Detector(text_mixed):>>>     print(language)
 name: Chinese     code: zh       confidence:  50.0 read bytes:  1996
 name: English     code: en       confidence:  49.0 read bytes:  1144
 name: un          code: un       confidence:   0.0 read bytes:     0

目前,cld2支持的语言检测类型有196种：

>>> Detector.supported_languages()

  1. Abkhazian    2. Afar    3. Afrikaans  ......  194. Yoruba    195. Zhuang    196. Zulu

分句、分词 Tokenization

自然语言处理任务中，任务可以分为字符级、词语级、句子级、段落级和篇章级，Tokenization就是实现切分字符、词语、句子和段落边界的功能。分段可以用\n、\n\r作分割，字符分割也比较容易实现，分句和分词相对比较复杂一点。

Example

导入依赖

from polyglot.text import Text

分句

>>> Text(text_cn).sentences [Sentence("日本最后一家寻呼机服务营业商宣布，将于2019年9月结束服务，标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务，该公司在20年前就已停止生产寻呼机。")]>>> Text(text_en).sentences [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.")]>>> Text(text_mixed).sentences [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."), Sentence("日本最后一家寻呼机服务营业商宣布，将于2019年9月结束服务，标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务，该公司在20年前就已停止生产寻呼机。")]

分词

>>> Text(text_cn).words
 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 ， 将 于 2019 年 9 月 结束 服务 ， 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 ， 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。>>> Text(text_en).words
 Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years .>>> Text(text_mixed).words
Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 ， 将 于 2019 年 9 月 结束 服务 ， 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 ， 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。

实体识别 Named Entity Recognition

实体识别是识别出文本中具有特定意义的实体，其常有三种分类：

实体类: 人名、地名、机构名、商品名、商标名等等
时间类: 日期、时间
数字类: 生日、电话号码、QQ号码等等

实体识别的方法也可以分为三种：

基于规则 Linguistic grammar-based techniques
基于语言语法的技术主要是用规则的方法，在工程的实现方面上的应用就是写很多的正则表达（RegEx），这种方式可以解决部分时间类、和数字类命名实体的识别。
统计学习 Statistical models
统计的方法目前主要是HMM和CRF模型，也是当前比较成熟的方式。
深度学习 Deep Learning models
深度学习的方法是目前最为流行的方式，特别是RNN系列的DL模型，其可以吸收到更多的文本语义信息，其效果是当前最好的。

polyglot实体识别的训练语料来源于维基百科（WIKI），其训练好的模型并没有初次安装，需要下载相应的模型。polyglot支持40种语言的实体类（人名、地名、机构名）的识别。

>>> from polyglot.downloader import downloader>>> print(downloader.supported_languages_table("ner2", 3))

 1. Polish    2. Turkish    ...... 39. Vietnamese    40. Estonian

模型下载

下载英文和中文实体识别的模型

$ python>>> import polyglot>>> !polyglot download ner2.en ner2.zh embeddings2.zh embeddings2.en[polyglot_data] Downloading package ner2.en to[polyglot_data] Downloading package ner2.zh to[polyglot_data] Downloading package embeddings2.zh to[polyglot_data] Downloadinuserackage embeddings2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

>>> from polyglot.text import Text

实体识别

>>> Text(text_cn).entities [I-ORG([u'东京'])]>>> Text(text_en).entities)
 [I-LOC([u'Tokyo'])]>>> Text(text_mixed).entities)
 [I-ORG([u'东京'])]

词性标注 Part of Speech Tagging

词性标注是对分词单元作相应的词性标记，其常用的标记包括：

形容词 ADJ: adjective
介词 ADP: adposition
副词 ADV: adverb
辅助动词 AUX: auxiliary verb
连词 CONJ: coordinating conjunction
限定词 DET: determiner
感叹词 INTJ: interjection
名词 NOUN: noun
数字 NUM: numeral
代词 PRON: pronoun
名词代词 PROPN: proper noun
标点符号 PUNCT: punctuation
从属连词 SCONJ: subordinating conjunction
符号 SYM: symbol
动词 VERB: verb
其他 X: other

polyglot训练词性标注的语料来源于CONLL数据集，其支持16种语言，不支持中文。

>>> from polyglot.downloader import downloader>>> print(downloader.supported_languages_table("pos2"))
  1. German    2. Italian    3. Danish    ......    14. Irish    15. Hungarian    16. Dutch

模型下载

下载英文词性标注的模型

$ python>>> import polyglot>>> !polyglot download pos2.en[polyglot_data] ownloading package pos2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

from polyglot.text import Text

词性标注

>>> Text(text_en).pos_tags [(u"Japan's", u'NUM'), (u'last', u'ADJ'), (u'pager', u'NOUN'), (u'provider', u'NOUN'), (u'has', u'AUX'), (u'announced', u'VERB'), (u'it', u'PRON'), (u'will', u'AUX'), (u'end', u'VERB'), (u'its', u'PRON'), (u'service', u'NOUN'), (u'in', u'ADP'), (u'September', u'PROPN'), (u'2019', u'NUM'), (u'-', u'PUNCT'), (u'bringing', u'VERB'), (u'a', u'DET'), (u'national', u'ADJ'), (u'end', u'NOUN'), (u'to', u'ADP'), (u'telecommunication', u'VERB'), (u'beepers', u'NUM'), (u',', u'PUNCT'), (u'50', u'NUM'), (u'years', u'NOUN'), (u'after', u'ADP'), (u'their', u'PRON'), (u'introduction.Around', u'NUM'), (u'1,500', u'NUM'), (u'users', u'NOUN'), (u'remain', u'VERB'), (u'subscribed', u'VERB'), (u'to', u'ADP'), (u'Tokyo', u'PROPN'), (u'Telemessage', u'PROPN'), (u',', u'PUNCT'), (u'which', u'DET'), (u'has', u'AUX'), (u'not', u'PART'), (u'made', u'VERB'), (u'the', u'DET'), (u'devices', u'NOUN'), (u'in', u'ADP'), (u'20', u'NUM'), (u'years', u'NOUN'), (u'.', u'PUNCT')]

情感分析 Sentiment Analysis

polyglot的情感分析是词级别的，对每一个分词正面标记为1，中性标记为0，负面标记为1.其目前支持136种语言。

>>> from polyglot.downloader import downloader>>> print(downloader.supported_languages_table("sentiment2"))
 1. Turkmen    2. Thai    ......    135. Bosnian-Croatian-Serbian    136. Slovene

模型下载

下载英文和中文情感分析模型

$ python>>> import polyglot>>> !polyglot download sentiment2.en sentiment2.zh[polyglot_data] ownloading package sentiment2.en to[polyglot_data] ownloading package sentiment2.zh to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

from polyglot.text import Text

情感分析

>>> text = Text("The movie is very good and the actors are prefect, but the cinema environment is very poor.")>>> print(text.words,text.polarity)
 (WordList([u'The', u'movie', u'is', u'very', u'good', u'and', u'the', u'actors', u'are', u'prefect', u',', u'but', u'the', u'cinema', u'environment', u'is', u'very', u'poor', u'.']), 0.0)>>> print([(w,w.polarity) for w in text.words])
 [(u'The', 0), (u'movie', 0), (u'is', 0), (u'very', 0), (u'good', 1), (u'and', 0), (u'the', 0), (u'actors', 0), (u'are', 0), (u'prefect', 0), (u',', 0), (u'but', 0), (u'the', 0), (u'cinema', 0), (u'environment', 0), (u'is', 0), (u'very', 0), (u'poor', -1), (u'.', 0)]>>> text = Text("这部电影故事非常好，演员也非常棒，但是电影院环境非常差。")>>> print(text.words,text.polarity)
 (WordList([这 部 电影 故事 非常 好 ， 演员 也 非常 棒 ， 但是 电影 院 环境 非常 差 。]), 0.0)>>> print([(w,w.polarity) for w in text.words])
 [(u'\u8fd9', 0), (u'\u90e8', 0), (u'\u7535\u5f71', 0), (u'\u6545\u4e8b', 0), (u'\u975e\u5e38', 0), (u'\u597d', 1), (u'\uff0c', 0), (u'\u6f14\u5458', 0), (u'\u4e5f', 0), (u'\u975e\u5e38', 0), (u'\u68d2', 0), (u'\uff0c', 0), (u'\u4f46\u662f', 0), (u'\u7535\u5f71', 0), (u'\u9662', 0), (u'\u73af\u5883', 0), (u'\u975e\u5e38', 0), (u'\u5dee', -1), (u'\u3002', 0)]

词嵌入 Word Embeddings

Word Embedding在NLP中是指一组语言模型和特征学习技术的总称，把词汇表中的单词或者短语映射成由实数构成的向量上。常见的Word Embeddings有两种方法：离散表示和分布式表示。离散的方法包括one-hot和N-gram，离散表示的缺点是不能很好的刻画词与词之间的相关性和维数灾难的问题。分布式表示的思想是用一个词附近的其他词来表示该词，也就是大家所熟悉的word2ec。word2ec包含根据当前一个词预测前后 $n$ 个词Skip-Gram Model以及给定上下文的 $n$ 个词预测一个词的CBOW Model。目前训练好的英文词向量有glove，其提供了50、100、200、300维词向量，以及前一段时间腾讯AI Lab开源的中文词向量，其提供200维的中文词向量。polyglot支持从以下不同源读取词向量

Gensim word2vec objects: (from_gensim method)
Word2vec binary/text models: (from_word2vec method)
GloVe models (from_glove method)
polyglot pickle files: (load method)

其中，polyglot pickle files支持136种语言的词向量。

>>> from polyglot.downloader import  downloader>>> print(downloader.supported_languages_table("embeddings2"))
  1. Scots    2. Sicilian     3. Welsh    ......    134. Occitan    135. Tajik    136. Piedmontese language

模型下载

下载英文和中文词向量

$ python>>> import polyglot>>> !polyglot download embeddings2.zh embeddings2.en[polyglot_data] Downloading package embeddings2.zh to[polyglot_data] Downloadinuserackage embeddings2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖并加载词向量

>>> from polyglot.mapping import Embedding>>> embeddings = Embedding.load('/home/user/polyglot_data/embeddings2/zh/embeddings_pkl.tar.bz2')

词向量查询

>>> print(embeddings.get("中国"))[ 0.60831094  0.37644583 -0.67009342  0.43529209  0.12993187 -0.07703398
 -0.04931475 -0.42763838 -0.42447501 -0.0219319  -0.52271312 -0.57149178
 -0.48139745 -0.31942225  0.12747335  0.34054375  0.27137381  0.1362032
 -0.54999739 -0.39569679  1.01767457  0.12317979 -0.12878017 -0.65476489
  0.18644606  0.2178454   0.18150428  0.18464987  0.29027358  0.21979097
 -0.21173042  0.08130789 -0.77350897  0.66575652 -0.14730017  0.11383133
  0.83101833  0.01702038 -0.71277034  0.29339811  0.3320756   0.25922608
 -0.51986367  0.16533957  0.04327472  0.36460632  0.42984027  0.04811303
 -0.16718218 -0.18613082 -0.52108622 -0.47057685 -0.14663117 -0.30221295
  0.72923231 -0.54835045 -0.48428732  0.65475166 -0.34853089  0.03206051
  0.2574054   0.07614037  0.32844698 -0.0087136 ]>>> print(len(embeddings.get("中国")))
 64

相似词查询

>>> neighbors = embeddings.nearest_neighbors("中国")>>> print(" ".join(neighbors))
 上海 美国 韩国 北京 欧洲 台湾 法国 德国 天津 广州

翻译 Transliteration

polyglot翻译采用是无监督的方法（ False-Friend Detection and Entity Matching via Unsupervised Transliteration paper），其支持69种语言。

>>> from polyglot.downloader import  downloader>>> print(downloader.supported_languages_table("transliteration2"))
 1. Haitian; Haitian Creole    2. Tamil    3. Vietnamese    ......    67. Greek, Modern    68. Esperanto    69. Maltese

模型下载

下载英文和中文翻译模型

$ python>>> import polyglot>>> !polyglot download transliteration2.zh transliteration2.en[polyglot_data] Downloading package transliteration2.zh to[polyglot_data] Downloadinuserackage transliteration2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

>>> from polyglot.text import Text

英文翻译中文

>>> text = Text(text_en)>>> print(text_en)
  Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.>>> print("".join([t for t in text.transliterate("zh")]))
 拉斯特帕格普罗维德尔哈斯安诺乌恩斯德伊特维尔恩德伊特斯塞尔维斯因塞普特艾伯布林吉恩格阿恩阿特伊奥纳尔恩德托特埃莱科姆穆尼卡特伊昂布熙佩尔斯年年耶阿尔斯阿夫特特海尔乌斯尔斯雷马因苏布斯克里贝德托托基奥特埃莱梅斯斯阿格埃惠克赫哈斯诺特马德特赫德耶夫伊斯斯因耶阿尔斯

中英文翻译的结果显示其效果还是比较差，在此不做过多的介绍。

管道 Pipelines

Pipelines的方式是指以管道的方式顺序执行多个NLP任务，上一个任务的输出作为下一个任务的输入。比如在实体识别和实体关系识别中，Pipeline方式就是先识别出实体，然后再识别这些实体的关系，另外一种是Join，将实体识别和关系识别放在一起。

Exmaple

先分词，然后统计词频数大于2的单词。

>>> !polyglot --lang en tokenize --input testdata/example.txt | polyglot count --min-count 2
 in  10the 6.   6-   5,   4of  3and 3by  3South       25   22007        2Bermuda     2which       2score       2against     2Mitchell    2as  2West        2India       2beat        2Afghanistan 2Indies      2