自然语言处理 | 使用Spacy 进行自然语言处理

2018 年 8 月 22 日 机器学习和数学

Spacy的github地址：https://github.com/explosion/spaCy

主页：https://spacy.io/

一、什么是Spacy

Spacy在它的主页上说它是Python里面的一个工业级别的自然语言处理工具，足见其在自然语言处理方面的优势，所以我们有必要去了解，学习它。Spacy的功能包括词性标注，句法分析，命名实体识别，词向量，与深度学习无缝对接，以及它支持三十多种语言等等。

二、安装

这部分包括Spacy包的安装和它的模型的安装，针对不同的语言，Spacy提供了不同的模型，需要分别安装。

1、Spacy的安装

一般通过pip就可以正常安装

pip install spacy

详细的安装介绍参考：https://spacy.io/usage/

Spacy也是跨平台的，支持windows、Linux、macOS等。

2、模型的安装

github： https://github.com/explosion/spacy-models

对于英语：

python -m spacy download en

或者

python -m spacy download en_core_web_lg

还可以通过URL地址来安装，下面两个都可以，如果pip安装速度慢，可以先下载到本地，使用下面的第一种方法。

pip install /你的/文件目录/en_core_web_sm-2.0.0.tar.gz pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz

另外这里提一下，我们每次从github上面clone代码的时候，速度有时候很慢，大概就几十k，我就觉得这太不能让人接受了，小点的项目还可以等，稍微大点的项目可能就要几个小时了，所以就随手百度了一下，还真有提速的办法，下面给个链接，跟着步骤稍稍设置一下，你就看到效果了，这里我就不多说了。

三、一个例子

导入模型

import spacy
nlp = spacy.load('en_core_web_sm')

或者


import en_core_web_sm
nlp = en_core_web_sm.load()

1、实体识别

text = (u"When Sebastian Thrun started working on"
u" self-driving cars at Google in 2007, "
u"few people outside of the company took"
u" him seriously. “I can tell you very "
u"senior CEOs of major American car companies"
u" would shake my hand and turn away because"
u" I wasn’t worth talking to,” said Thrun, "
u"now the co-founder and CEO of online higher"
u" education startup Udacity, in an interview"
u" with Recode earlier this week.")

doc = nlp(text)
print("########################################")

for entity in doc.ents:
print("{}:{}".format(entity.text, entity.label_))
print("########################################")

########################################

Sebastian Thrun:PERSON

Google:ORG

2007:DATE

American:NORP

Thrun:PERSON

Recode:ORG

earlier this week:DATE

########################################

下面这张表是Spacy里面实体的标签及其表示的含义

PERSON	People, including fictional.	人物
NORP	Nationalities or religious or political groups.	国家、宗教、政治团体
FAC	Buildings, airports, highways, bridges, etc.	建筑、机场、高速公路、桥梁等
ORG	Companies, agencies, institutions, etc.	组织公司、机构等
GPE	Countries, cities, states.	国家、城市、州
LOC	Non-GPE locations, mountain ranges, bodies of water.	山脉、水体等
PRODUCT	Objects, vehicles, foods, etc. (Not services.)	车辆、食物等非服务性的产品
EVENT	Named hurricanes, battles, wars, sports events, etc.	飓风、战争、体育赛事等
WORK_OF_ART	Titles of books, songs, etc.	书名、歌名等
LAW	Named documents made into laws.	法律文书
LANGUAGE	Any named language.	语言
DATE	Absolute or relative dates or periods.	日期
TIME	Times smaller than a day.	小于1天的时间
PERCENT	Percentage, including "%".	百分比
MONEY	Monetary values, including unit.	货币价值
QUANTITY	Measurements, as of weight or distance.	度量单位
ORDINAL	"first", "second", etc.	序数词
CARDINAL	Numerals that do not fall under another type.	数量词

2、文本相似度


doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(similarity)

# 0.713970251872

今天先这样，后面还会继续介绍Spacy的其他功能，敬请期待~

加快git clone的方法：http://blog.51cto.com/11887934/2051323

登录查看更多

相关内容

spaCy

关注 1

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

139+阅读 · 2020年7月10日

【干货书】用Python构建聊天机器人，205页pdf，使用自然语言处理和机器学习

专知会员服务

219+阅读 · 2020年6月14日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

深度学习自然语言处理概述，216页ppt，Jindřich Helcl

专知会员服务

216+阅读 · 2020年4月26日

【教程】自然语言处理中的迁移学习原理，41 页PPT

专知会员服务

96+阅读 · 2020年2月8日

【电子书】自然语言处理（Natural Language Processing）587页PDF免费下载

专知会员服务

67+阅读 · 2019年10月30日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

深度学习自然语言处理综述，266篇参考文献

专知会员服务

231+阅读 · 2019年10月12日

学习自然语言处理路线图

专知会员服务

139+阅读 · 2019年9月24日

【Strata Data Conference】用于自然语言处理的深度学习方法

专知会员服务

49+阅读 · 2019年9月23日

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

【Github】All4NLP：自然语言处理相关资源整理

AINLP

23+阅读 · 2019年8月9日

自然语言处理NLP之旅（NLP文章/代码集锦）

专知

28+阅读 · 2019年8月6日

Python自然语言处理: 使用SpaCycle库进行标记化、词干提取和词形还原

Python程序员

18+阅读 · 2019年3月28日

Python自然语言处理工具NLTK学习导引及相关资料

AINLP

5+阅读 · 2019年1月28日

polyglot：Pipeline 多语言NLP工具

AINLP

4+阅读 · 2018年12月11日

自然语言处理 | 使用Spacy 进行自然语言处理（二）

机器学习和数学

10+阅读 · 2018年8月27日

在Python中使用SpaCy进行文本分类

专知

24+阅读 · 2018年5月8日

Python NLP入门教程

计算机与网络安全

9+阅读 · 2017年11月21日

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

A Survey on Contextual Embeddings

Arxiv

29+阅读 · 2020年3月16日

A Survey of the Usages of Deep Learning in Natural Language Processing

Arxiv

122+阅读 · 2019年9月11日

Fine-tune BERT for Extractive Summarization

Arxiv

3+阅读 · 2019年9月5日

Learning Implicit Fields for Generative Shape Modeling

Arxiv

10+阅读 · 2018年12月6日

Notes on Deep Learning for NLP

Arxiv

22+阅读 · 2018年8月30日

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Arxiv

3+阅读 · 2018年8月27日

A Tidy Data Model for Natural Language Processing using cleanNLP

Arxiv

4+阅读 · 2018年5月3日

Sentiment Analysis of Comments on Rohingya Movement with Support Vector Machine

Arxiv

9+阅读 · 2018年3月22日

PEYMA: A Tagged Corpus for Persian Named Entities

Arxiv

5+阅读 · 2018年1月30日

VIP会员