Spacy的github地址:https://github.com/explosion/spaCy
主页:https://spacy.io/
一、什么是Spacy
Spacy在它的主页上说它是Python里面的一个工业级别的自然语言处理工具,足见其在自然语言处理方面的优势,所以我们有必要去了解,学习它。Spacy的功能包括词性标注,句法分析,命名实体识别,词向量,与深度学习无缝对接,以及它支持三十多种语言等等。
二、安装
这部分包括Spacy包的安装和它的模型的安装,针对不同的语言,Spacy提供了不同的模型,需要分别安装。
1、Spacy的安装
一般通过pip就可以正常安装
pip install spacy
详细的安装介绍参考:https://spacy.io/usage/
Spacy也是跨平台的,支持windows、Linux、macOS等。
2、模型的安装
github: https://github.com/explosion/spacy-models
对于英语:
python -m spacy download en
或者
python -m spacy download en_core_web_lg
还可以通过URL地址来安装,下面两个都可以,如果pip安装速度慢,可以先下载到本地,使用下面的第一种方法。
pip install /你的/文件目录/en_core_web_sm-2.0.0.tar.gz pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
另外这里提一下,我们每次从github上面clone代码的时候,速度有时候很慢,大概就几十k,我就觉得这太不能让人接受了,小点的项目还可以等,稍微大点的项目可能就要几个小时了,所以就随手百度了一下,还真有提速的办法,下面给个链接,跟着步骤稍稍设置一下,你就看到效果了,这里我就不多说了。
三、一个例子
导入模型
import spacy
nlp = spacy.load('en_core_web_sm')
或者
import en_core_web_sm
nlp = en_core_web_sm.load()
1、实体识别
text = (u"When Sebastian Thrun started working on"
u" self-driving cars at Google in 2007, "
u"few people outside of the company took"
u" him seriously. “I can tell you very "
u"senior CEOs of major American car companies"
u" would shake my hand and turn away because"
u" I wasn’t worth talking to,” said Thrun, "
u"now the co-founder and CEO of online higher"
u" education startup Udacity, in an interview"
u" with Recode earlier this week.")
doc = nlp(text)
print("########################################")
for entity in doc.ents:
print("{}:{}".format(entity.text, entity.label_))
print("########################################")
########################################
Sebastian Thrun:PERSON
Google:ORG
2007:DATE
American:NORP
Thrun:PERSON
Recode:ORG
earlier this week:DATE
########################################
下面这张表是Spacy里面实体的标签及其表示的含义
PERSON |
People, including fictional. |
人物 |
NORP |
Nationalities or religious or political groups. |
国家、宗教、政治团体 |
FAC |
Buildings, airports, highways, bridges, etc. |
建筑、机场、高速公路、桥梁等 |
ORG |
Companies, agencies, institutions, etc. |
组织公司、机构等 |
GPE |
Countries, cities, states. |
国家、城市、州 |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
山脉、水体等 |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
车辆、食物等非服务性的产品 |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
飓风、战争、体育赛事等 |
WORK_OF_ART |
Titles of books, songs, etc. |
书名、歌名等 |
LAW |
Named documents made into laws. |
法律文书 |
LANGUAGE |
Any named language. |
语言 |
DATE |
Absolute or relative dates or periods. |
日期 |
TIME |
Times smaller than a day. |
小于1天的时间 |
PERCENT |
Percentage, including "%". |
百分比 |
MONEY |
Monetary values, including unit. |
货币价值 |
QUANTITY |
Measurements, as of weight or distance. |
度量单位 |
ORDINAL |
"first", "second", etc. |
序数词 |
CARDINAL |
Numerals that do not fall under another type. |
数量词 |
2、文本相似度
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(similarity)
# 0.713970251872
今天先这样,后面还会继续介绍Spacy的其他功能,敬请期待~
加快git clone的方法:http://blog.51cto.com/11887934/2051323