【语言处理与Python】7.4语言结构中的递归
用级联分块器构建嵌套结构
例如如下,是名词短语、介词短语、动词短语和句子的模式。一个四级块语法器。
grammar= r""" NP:{<DT|JJ|NN.*>+} #Chunksequences of DT,JJ, NN PP:{<IN><NP>} #Chunkprepositions followed byNP VP:{<VB.*><NP|PP|CLAUSE>+$} #Chunkverbs and their arguments CLAUSE:{<NP><VP>} #ChunkNP,VP """ cp= nltk.RegexpParser(grammar) sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")] >>>print cp.parse(sentence) (S (NP Mary/NN) saw/VBD (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN(NP the/DT mat/NN)))))
但是,这个是有缺陷的,没有认出来saw为首的VP。如果句子嵌套更深,那么更无法正常工作,我们可以设置循环次数,来解决这个问题。
>>>cp = nltk.RegexpParser(grammar,loop=2) >>>print cp.parse(sentence) (S (NP John/NNP) thinks/VBZ (CLAUSE (NP Mary/NN) (VP saw/VBD (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN(NP the/DT mat/NN)))))))
级联这种方法是有局限性的,创建和调试困难,只能产生固定深度的树,完整句法分析是不够的。
树
树大家应该都很熟悉,不过多介绍树的定义。
在NLTK中,我们也可以创造树。
>>>tree1 = nltk.Tree('NP',['Alice']) >>>print tree1 (NP Alice) >>>tree2 = nltk.Tree('NP',['the', 'rabbit']) >>>print tree2 (NP the rabbit) #我们也可以合并树 >>>tree3 = nltk.Tree('VP',['chased', tree2]) >>>tree4 = nltk.Tree('S',[tree1, tree3]) >>>print tree4 (S (NP Alice)(VP chased (NP the rabbit))) #这里是树的一些方法 >>>print tree4[1] (VP chased(NP the rabbit)) >>>tree4[1].node 'VP' >>>tree4.leaves() ['Alice', 'chased', 'the', 'rabbit'] >>>tree4[1][1][1] 'rabbit' #有的树直接看代码会很不直观,我们可以画出来 >>>tree3.draw()
树遍历
def traverse(t): try: t.node except AttributeError: print t, else: #Nowweknowthat t.node is defined print '(', t.node, for child in t: traverse(child) print ')', >>>t =nltk.Tree('(S(NP Alice)(VP chased(NP the rabbit)))') >>>traverse(t) ( S( NPAlice) ( VPchased ( NPthe rabbit ) ) )