Python自然语言处理学习笔记之信息提取步骤&分块（chunking）

一、信息提取模型　　

　　信息提取的步骤共分为五步，原始数据为未经处理的字符串，

第一步：分句，用nltk.sent_tokenize(text)实现,得到一个list of strings

第二步：分词，[nltk.word_tokenize(sent) for sent in sentences]实现，得到list of lists of strings

第三步：标记词性，[nltk.pos_tag(sent) for sent in sentences]实现得到一个list of lists of tuples

前三步可以定义在一个函数中：

>>> def ie_preprocess(document):
...    sentences = nltk.sent_tokenize(document) 
...    sentences = [nltk.word_tokenize(sent) for sent in sentences] 
...    sentences = [nltk.pos_tag(sent) for sent in sentences]

第四步：实体识别（entity detection）在这一步，既要识别已定义的实体（指那些约定成俗的习语和专有名词），也要识别未定义的实体，得到一个树的列表

第五步：关系识别（relation detection）寻找实体之间的关系，并用tuple标记，最后得到一个tuple列表

二、分块（chunking）

　　分块是第四步entity detection的基础，本文只介绍一种块noun phrase chunking即NP-chunking，这种块通常比完整的名词词组小，例如：the market for system-management software是一个名词词组，但是它会被分为两个NP-chunking——the market 和 system-management software。任何介词短语和从句都不会包含在NP-chunking中，因为它们内部总是会包含其他的名词词组。

　　从一个句子中提取分块需要用到正则表达式，先给出示例代码：

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), 
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
     
>>> print(cp.parse(sentence)) 
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))

　　正则表达式的格式为"""块名：{<表达式>...<>}

{...}”""

如：

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""

　　大括号内为分块规则（chunking rule），可以有一个或多个，当rule不止一个时，RegexpParser会依次调用各个规则，并不断更新分块结果，直到所有的rule都被调用。nltk.RegexpParser(grammar)用于依照chunking rule创建一个chunk分析器，cp.parse()则在目标句子中运行分析器，最后的结果是一个树结构，我们可以用print打印它，或者用result.draw()将其画出。

　　在chunking rule中还用一种表达式chink，用于定义chunk中我们不想要的模式，这种表达式的格式为：‘ }表达式{ ’ 使用chink的结果一般有三种，一、chink定义的表达式和整个chunk都匹配，则将整个chunk删除；二、匹配的序列在chunk中间，则chunk分裂为两个小chunk；三、在chunk的边缘，则chunk会变小。使用方法如下：

grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
     
>>> print(cp.parse(sentence))
 (S
   (NP the/DT little/JJ yellow/JJ dog/NN)
   barked/VBD
   at/IN
   (NP the/DT cat/NN))

posted on 2015-08-24 19:37 cs_暗流阅读(12478) 评论(0) 编辑收藏举报

刷新页面返回顶部

cs_暗流

Python自然语言处理学习笔记之信息提取步骤&分块（chunking）

导航

公告