BERT使用记录/KenLM避坑
使用 bert 生成词向量:
##### 运行此脚本
export BERT_BASE_DIR = ./chinese_L-12_H-768_A-12 ## 模型地址
exprot Data_Dir = ./data
python bert-master/extract_features.py \
--input_file=$Data_Dir/train_ch.txt \
--output_file=$Data_dir/output.json \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--layers=-1,-2,-3,-4 \
--max_seq_length=128 \
--batch_size=8
结果文件是这种形式:
{"linex_index": 0, "features":[{"token": "[CLS]", "layers": [{"index": -1, "values":[-0.919886, 0.656876, -0.58464654]}]}]}
解码代码:
import re
import json
src = ''
tgt = ''
def fun(file1,file2):
with open(file1,'r',encoding='utf-8') as fl1:
with open(file2,'w',encoding='utf-8') as fl2:
k=0
for line in fl1.readlines():
k+=1
line = json.loads(line)
temp = line.get('features')
temp = temp[1]
temp = temp.get('layers')
temp = temp[1]
temp = temp.get("values")
fl2.write(str(temp)+'\n'+'\n') ## 好看一些
if k%1000==0:
print("Done"+' '+str(k))
fun(src,tgt)
KenLM安装
首先要安装 cmake
sudo apt-get install cmake
如果在运行 " cmake .. " 时boost报错,运行下边的代码
sudo apt-get install libboost-all-dev
训练KenLM模型
bin/lmplz -o 5 <train.txt >out.arpa
将模型转换成二进制文件
bin/build_binary out.arpa out.bin
计算句子的困惑度
首先定义窗口的大小
#### 设置大小为5的窗口,并依次读取
def func_5(seq):
start = -2
end = 0
temp = []
while end < len(seq):
sub = []
sub.append(start)
sub.append(end)
start += 1
end += 1
temp.append(sub)
return temp
def func_word(seq):
lemp = []
temp = func_5(seq)
for unit in temp:
sub = []
start = unit[0]
end = unit[1]
for i in range(start,end+1):
if i>=0 and i<len(seq):
sub.append(seq[i])
lemp.append(sub)
return lemp
然后计算困惑度
import kenlm
model = kenlm.LanguageModel('out.bin')
### 返回ngram-score
def func_1(seq):
lemp = []
temp = func_word(seq)
for ww in temp:
sc = ''
for unit in ww:
sc+=unit+' '
num = model.score(sc)
lemp.append(num)
return lemp
### 返回困惑度分数
def func_2(seq):
nn = len(seq):
sun_num = 0
lemp = func_1(seq)
for ss in lemp:
sum_num += ss
num = (sum_num/nn)*(-1)
return num