转：CRF++词性标注

CRF++词性标注

2016-02-28 分类：NLP 阅读(5558) 评论(19)

训练和测试的语料都是人民日报98年标注语料，训练和测试比例是10：1，直接通过CRF++标注词性的准确率:0.933882。特征有一千多万个，训练时间比较长。机器cpu是48核，通过crf++，指定并线数量 -p为40，训练了大概七个小时才结束。

语料库、生成训练数据的python脚本、训练日志、模型、计算准确率脚本都上传到网盘，可以直接下载：戳我下载 CRF++词性标注，程序在centos6.5+python2.7下面运行通过，如果在win下或者ubuntu下可能会有异常，通常都是编码、路径规范等小问题，通过逐行debug脚本应该很容易找到问题，同时要确定crf++在自己机器本身编译没有问题，下面说一下每一步的过程。

文章目录 [展开]

生成训练和测试数据

生成训练和测试数据脚本：get_post_train_test_data.py，执行过程中会打印出来一些调试信息。

#coding=utf8

import sys

#home_dir = "D:/source/NLP/people_daily//"

home_dir = "./"

def saveDataFile(trainobj,testobj,isTest,word,handle):

if isTest:

saveTrainFile(testobj,word,handle)

else:

saveTrainFile(trainobj,word,handle)

def saveTrainFile(fiobj,word,handle):

if len(word) > 0 and word != "。" and word != "，":

fiobj.write(word + '\t' + handle + '\n')

else:

fiobj.write('\n')

def convertTag():

fiobj = open( home_dir + 'people-daily.txt','r')

trainobj = open( home_dir +'train.data','w' )

testobj = open( home_dir +'test.data','w')

arr = fiobj.readlines()

i = 0

for a in sys.stdin:

i += 1

a = a.strip('\r\n\t ')

if a=="":continue

words = a.split(" ")

test = False

if i % 10 == 0:

test = True

for word in words[1:]:

print "---->", word

word = word.strip('\t ')

if len(word) > 0:

i1 = word.find('[')

if i1 >= 0:

word = word[i1+1:]

i2 = word.find(']')

if i2 > 0:

w = word[:i2]

word_hand = word.split('/')

print "----",word

w,h = word_hand

#print w,h

if h == 'nr': #ren min

#print 'NR',w

if w.find('·') >= 0:

tmpArr = w.split('·')

for tmp in tmpArr:

saveDataFile(trainobj,testobj,test,tmp,h)

continue

saveDataFile(trainobj,testobj,test,w,h)

saveDataFile(trainobj, testobj, test,"","")

trainobj.flush()

testobj.flush()

if __name__ == '__main__':

convertTag()

执行训练和测试

设置模板为：

# Unigram

U00:%x[-2,0]

U01:%x[-1,0]

U02:%x[0,0]

U03:%x[1,0]

U04:%x[2,0]

U05:%x[-1,0]/%x[0,0]

U06:%x[0,0]/%x[1,0]

训练的时候的-p参数根据自己机器情况设置

1 2	crf_learn -f 3 -p 4 -c 4.0 template train.data model > train.rst crf_test -m model test.data > test.rst

计算准确率

通过命令：python clc_f.py test.rst 执行python脚本，clc_f.py中的具体程序：

#!/usr/bin/python

# -*- coding: utf-8 -*-

import sys

if __name__=="__main__":

try:

file = open(sys.argv[1], "r")

except:

print "result file is not specified, or open failed!"

sys.exit()

wc = 0

wc_of_test = 0

wc_of_gold = 0

wc_of_correct = 0

flag = True

for l in file:

if l=='\n': continue

_, g, r = l.strip().split()

if r != g:

flag = False

wc += 1

if flag:

wc_of_correct +=1

flag = True

print "WordCount from result:", wc

print "WordCount of correct post :", wc_of_correct

#准确率

P = wc_of_correct/float(wc)

print "准确率:%f" % (P)

实验结果

posted @ 2017-08-29 14:54 Django's blog 阅读(1543) 评论(0) 收藏举报

刷新页面返回顶部

Django's blog