Spacy 学习

spacy

sentencizer

将文章切分成句子，原理是Spacy通过将文章中某些单词的is_sent_start属性设置为True，来实现对文章的句子的切分，这些特殊的单词在规则上对应于句子的开头。

import spacy
nlp = spacy.load('en_core_web_sm')# 加载预训练模型

txt = "some text read from one paper ..."
doc = nlp(txt)

for sent in doc.sents:
    print(sent)
    print('#'*50)

some text read from one paper ...
##################################################

Tokenization

将句子切分成单词，英文中一般使用空格分隔

txt = "A magnetic monopole is a hypothetical elementary particle."
doc = nlp(txt)
tokens = [token for token in doc]
print(tokens)

[A, magnetic, monopole, is, a, hypothetical, elementary, particle, .]

Part-of-speech tagging

词性标注，标注句子中每个单词的词性，是名词动词还是形容词。

对应于中文是【冠词（DET）、形容词（ADJ）、名词（NOUN）、动词（VERB）、标点（PUNCT）】

pos = [token.pos_ for token in doc]
print(pos)

['DET', 'ADJ', 'NOUN', 'AUX', 'DET', 'ADJ', 'ADJ', 'NOUN', 'PUNCT']

Lemmatization

找到单词的原型，即词性还原，将am, is, are, have been 还原成be，复数还原成单数(cats -> cat)，过去时态还原成现在时态 (had -> have)。在代码中使用 token.lemma_ 提取

lem = [token.lemma_ for token in doc]
print(lem)

['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']

Stop words

识别停用词，a,the等等

stop_words = [token.is_stop for token in doc]
print(stop_words)

# 可以看到，这个例子中停用词有 a 和 is。

[True, False, False, True, True, False, False, False, False]

Dependency Parsing

依存分析，标记单词是主语，谓语，宾语还是连接词。程序中使用 token.dep_ 提取。
依存分析标签对应

dep = [token.dep_ for token in doc]
print(dep)

# [限定词， 形容词修饰, 名词主语， 根节点, 限定词, 形容词修饰, 形容词修饰, 属性, 标点]

['det', 'amod', 'nsubj', 'ROOT', 'det', 'amod', 'amod', 'attr', 'punct']

Noun Chunks

提取名词短语，程序中使用 doc.noun_chunks 获取

noun_chunks = [nc for nc in doc.noun_chunks]
print(noun_chunks)

[A magnetic monopole, a hypothetical elementary particle]

Named Entity Recognization

命名实体识别，识别人名，地名，组织机构名，日期，时间，金额，事件，产品等等。程序使用doc.ents获取

txt = ''''European authorities fined Google a record $5.1 billion
on Wednesday for abusing its power in the mobile phone market and
ordered the company to alter its practices'
'''
doc = nlp(txt)
ners = [(ent.text, ent.label_) for ent in doc.ents]
print(ners)

[('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]

详细的命名实体简写表

TYPE	DESCRIPTION
PERSON	People,including fictional.
NORP	Nationalities or religious or political groups.
FAC	Buildings, airports, highways, bridges,etc.
ORG	companies, agencies, institutions, etc.
GPE	countries, cities, states
LOC	non-GPE locations, mountain ranges, bodies of water.
PRODUCT	objects, vehicles, foods, etc.(Not services)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART	Title of books, songs, etc.
LAW	named documents made into laws
LANGUAGE	any named language
DATE	absolute or relative dates or periods
TIME	Times smaller than a day.
PERCENT	Percentage, including "%"
MONEY	Monetary values, including unit.
QUANTITY	measurements, as of weight or distance.
ORDINAL	"first", "second", etc.
CARDINAL	Numerals that do not fall under another type.

对应表链接

Coreference Resolution

指代消解，寻找句子中代词 he，she，it 所对应的实体。为了使用这个模块，需要使用神经网络预训练的指代消解系数，如果前面没有安装，可运行命令：pip install neuralcoref

txt = "My sister has a son and she loves him."

# 将预训练的神经网络指代消解加入到spacy的管道中
import neuralcoref
neuralcoref.add_to_pipe(nlp)

doc = nlp(txt)
doc._.coref_clusters

D:\software\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: spacy.morphology.Morphology size changed, may indicate binary incompatibility. Expected 104 from C header, got 112 from PyObject
  return f(*args, **kwds)
D:\software\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: spacy.vocab.Vocab size changed, may indicate binary incompatibility. Expected 96 from C header, got 112 from PyObject
  return f(*args, **kwds)
D:\software\Anaconda3\lib\importlib\_bootstrap.py:219: RuntimeWarning: spacy.tokens.span.Span size changed, may indicate binary incompatibility. Expected 72 from C header, got 80 from PyObject
  return f(*args, **kwds)

Display

可视化。把这条功能单独列出来，是因为它太酷了。举几个简单的例子，第一个例子是对依存分析的可视化，

from spacy import displacy
import spacy

nlp = spacy.load('en_core_web_sm')# 加载预训练模型
txt = '''In particle physics, a magnetic monopole is a 
hypothetical elementary particle.'''
displacy.render(nlp(txt), style='dep', jupyter=True, options = {'distance': 90})  # 使用style来设置是针对谁的可视化

第二个例子是对命名实体识别的可视化

from spacy import displacy

txt = ''''European authorities fined Google a record $5.1 billion
on Wednesday for abusing its power in the mobile phone market and
ordered the company to alter its practices'
'''
doc = nlp(txt)
displacy.render(doc, style='ent', jupyter=True)

知识提取

这一部分使用了 textacy, 需要通过pip命令进行安装，textacy.extract 里面的 semistructured_statements() 函数可以提取主语是 Magnetic Monopole，谓语原型是 be 的所有事实。

首先将维基百科上的关于磁单极的这篇介绍的文字拷贝到 magneti_monopole.txt 中。

import textacy.extract

nlp = spacy.load('en_core_web_sm')

with open("magneti_monopole.txt", "r") as fin:
    txt = fin.read()

doc = nlp(txt)
statements = textacy.extract.semistructured_statements(doc, "monopole")
for statement in statements:
    subject, verb, fact = statement
    print(f" - {fact}")

Matcher

spaCy有一个规则匹配引擎，即Matcher，它通过token进行操作，类似于正则表达式。
规则介绍

可获取的token属性

attribute	description
ORTH	token精确的文本
LOWER, UPPER	token文本的大小写形式
IS_ALPHA, IS_ASCII, IS_DIGIT	token文本由字母数字字符、ASCII字符、数字组成
IS_LOWER, IS_UPPER, IS_TITLE	token文本为大写，小写，标题格式
IS_PUNCT, IS_SPACE, IS_STOP	token文本是标点，空白，停用词
LIKE_NUM, LIKE_URL, LIKE_EMAIL	token文本类似于数字，网址，邮件地址
POS, TAG, DEP, LEMMA, SHAPE	token文本的词性之类的……

posted @ 2020-10-18 12:16 戴墨镜的长颈鹿阅读(641) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

戴墨镜的长颈鹿