Spacy 学习
spacy
sentencizer
将文章切分成句子,原理是Spacy通过将文章中某些单词的is_sent_start属性设置为True,来实现对文章的句子的切分,这些特殊的单词在规则上对应于句子的开头。
import spacy
nlp = spacy.load('en_core_web_sm')# 加载预训练模型
txt = "some text read from one paper ..."
doc = nlp(txt)
for sent in doc.sents:
print(sent)
print('#'*50)
some text read from one paper ...
##################################################
Tokenization
将句子切分成单词,英文中一般使用空格分隔
txt = "A magnetic monopole is a hypothetical elementary particle."
doc = nlp(txt)
tokens = [token for token in doc]
print(tokens)
[A, magnetic, monopole, is, a, hypothetical, elementary, particle, .]
Part-of-speech tagging
词性标注,标注句子中每个单词的词性,是名词动词还是形容词。
- 对应于中文是【冠词(DET)、形容词(ADJ)、名词(NOUN)、动词(VERB)、标点(PUNCT)】
pos = [token.pos_ for token in doc]
print(pos)
['DET', 'ADJ', 'NOUN', 'AUX', 'DET', 'ADJ', 'ADJ', 'NOUN', 'PUNCT']
Lemmatization
找到单词的原型,即词性还原,将am, is, are, have been 还原成be,复数还原成单数(cats -> cat),过去时态还原成现在时态 (had -> have)。在代码中使用 token.lemma_ 提取
lem = [token.lemma_ for token in doc]
print(lem)
['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']
Stop words
识别停用词,a,the等等
stop_words = [token.is_stop for token in doc]
print(stop_words)
# 可以看到,这个例子中停用词有 a 和 is。
[True, False, False, True, True, False, False, False, False]
Dependency Parsing
依存分析,标记单词是主语,谓语,宾语还是连接词。程序中使用 token.dep_ 提取。
依存分析标签对应
dep = [token.dep_ for token in doc]
print(dep)
# [限定词, 形容词修饰, 名词主语, 根节点, 限定词, 形容词修饰, 形容词修饰, 属性, 标点]
['det', 'amod', 'nsubj', 'ROOT', 'det', 'amod', 'amod', 'attr', 'punct']
Noun Chunks
提取名词短语,程序中使用 doc.noun_chunks 获取
noun_chunks = [nc for nc in doc.noun_chunks]
print(noun_chunks)
[A magnetic monopole, a hypothetical elementary particle]
Named Entity Recognization
命名实体识别,识别人名,地名,组织机构名,日期,时间,金额,事件,产品等等。程序使用doc.ents
获取
txt = ''''European authorities fined Google a record $5.1 billion
on Wednesday for abusing its power in the mobile phone market and
ordered the company to alter its practices'
'''
doc = nlp(txt)
ners = [(ent.text, ent.label_) for ent in doc.ents]
print(ners)
[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]
详细的命名实体简写表
TYPE | DESCRIPTION |
---|---|
PERSON | People,including fictional. |
NORP | Nationalities or religious or political groups. |
FAC | Buildings, airports, highways, bridges,etc. |
ORG | companies, agencies, institutions, etc. |
GPE | countries, cities, states |
LOC | non-GPE locations, mountain ranges, bodies of water. |
PRODUCT | objects, vehicles, foods, etc.(Not services) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART | Title of books, songs, etc. |
LAW | named documents made into laws |
LANGUAGE | any named language |
DATE | absolute or relative dates or periods |
TIME | Times smaller than a day. |
PERCENT | Percentage, including "%" |
MONEY | Monetary values, including unit. |
QUANTITY | measurements, as of weight or distance. |
ORDINAL | "first", "second", etc. |
CARDINAL | Numerals that do not fall under another type. |
Coreference Resolution
指代消解,寻找句子中代词 he,she,it 所对应的实体。为了使用这个模块,需要使用神经网络预训练的指代消解系数,如果前面没有安装,可运行命令:pip install neuralcoref
txt = "My sister has a son and she loves him."
# 将预训练的神经网络指代消解加入到spacy的管道中
import neuralcoref
neuralcoref.add_to_pipe(nlp)
doc = nlp(txt)
doc._.coref_clusters
D:\software\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: spacy.morphology.Morphology size changed, may indicate binary incompatibility. Expected 104 from C header, got 112 from PyObject
return f(*args, **kwds)
D:\software\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: spacy.vocab.Vocab size changed, may indicate binary incompatibility. Expected 96 from C header, got 112 from PyObject
return f(*args, **kwds)
D:\software\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: spacy.tokens.span.Span size changed, may indicate binary incompatibility. Expected 72 from C header, got 80 from PyObject
return f(*args, **kwds)
Display
可视化。把这条功能单独列出来,是因为它太酷了。举几个简单的例子,第一个例子是对依存分析的可视化,
from spacy import displacy
import spacy
nlp = spacy.load('en_core_web_sm')# 加载预训练模型
txt = '''In particle physics, a magnetic monopole is a
hypothetical elementary particle.'''
displacy.render(nlp(txt), style='dep', jupyter=True, options = {'distance': 90}) # 使用style来设置是针对谁的可视化
第二个例子是对命名实体识别的可视化
from spacy import displacy
txt = ''''European authorities fined Google a record $5.1 billion
on Wednesday for abusing its power in the mobile phone market and
ordered the company to alter its practices'
'''
doc = nlp(txt)
displacy.render(doc, style='ent', jupyter=True)
知识提取
这一部分使用了 textacy, 需要通过pip命令进行安装,textacy.extract 里面的 semistructured_statements() 函数可以提取主语是 Magnetic Monopole,谓语原型是 be 的所有事实。
首先将维基百科上的关于磁单极的这篇介绍的文字拷贝到 magneti_monopole.txt 中。
import textacy.extract
nlp = spacy.load('en_core_web_sm')
with open("magneti_monopole.txt", "r") as fin:
txt = fin.read()
doc = nlp(txt)
statements = textacy.extract.semistructured_statements(doc, "monopole")
for statement in statements:
subject, verb, fact = statement
print(f" - {fact}")
Matcher
spaCy有一个规则匹配引擎,即Matcher,它通过token进行操作,类似于正则表达式。
规则介绍
可获取的token属性
attribute | description |
---|---|
ORTH | token精确的文本 |
LOWER, UPPER | token文本的大小写形式 |
IS_ALPHA, IS_ASCII, IS_DIGIT | token文本由字母数字字符、ASCII字符、数字组成 |
IS_LOWER, IS_UPPER, IS_TITLE | token文本为大写,小写,标题格式 |
IS_PUNCT, IS_SPACE, IS_STOP | token文本是标点,空白,停用词 |
LIKE_NUM, LIKE_URL, LIKE_EMAIL | token文本类似于数字,网址,邮件地址 |
POS, TAG, DEP, LEMMA, SHAPE | token文本的词性之类的…… |