BERT使用记录/KenLM避坑

使用 bert 生成词向量:

##### 运行此脚本 
export BERT_BASE_DIR = ./chinese_L-12_H-768_A-12 ## 模型地址
exprot Data_Dir = ./data

python bert-master/extract_features.py \
  --input_file=$Data_Dir/train_ch.txt \
  --output_file=$Data_dir/output.json \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

结果文件是这种形式:

 {"linex_index": 0, "features":[{"token": "[CLS]", "layers": [{"index": -1, "values":[-0.919886, 0.656876, -0.58464654]}]}]}

解码代码:

import re
import json

src = ''
tgt = ''

def fun(file1,file2):
  with open(file1,'r',encoding='utf-8') as fl1:
    with open(file2,'w',encoding='utf-8') as fl2:
      k=0
      for line in fl1.readlines():
        k+=1
        line = json.loads(line)
        temp = line.get('features')
        temp = temp[1]
        temp = temp.get('layers')
        temp = temp[1]
        temp = temp.get("values")
        fl2.write(str(temp)+'\n'+'\n') ## 好看一些
        if k%1000==0:
          print("Done"+' '+str(k))

fun(src,tgt)  

 KenLM安装

  首先要安装 cmake 

sudo apt-get install cmake

  如果在运行 " cmake .. " 时boost报错,运行下边的代码

sudo apt-get install libboost-all-dev

  训练KenLM模型

bin/lmplz -o 5 <train.txt >out.arpa

  将模型转换成二进制文件

bin/build_binary out.arpa out.bin

   计算句子的困惑度

  首先定义窗口的大小

#### 设置大小为5的窗口,并依次读取
def func_5(seq):
  start = -2
  end = 0
  temp = []
  while end < len(seq):
    sub = []
    sub.append(start)
    sub.append(end)
    start += 1
    end += 1
    temp.append(sub)
  return temp

def func_word(seq):
  lemp = []
  temp = func_5(seq)
  for unit in temp:
    sub = []
    start = unit[0]
    end = unit[1]
    for i in range(start,end+1):
      if i>=0 and i<len(seq):
        sub.append(seq[i])
    lemp.append(sub)
  return lemp

  然后计算困惑度

import kenlm

model = kenlm.LanguageModel('out.bin')

### 返回ngram-score
def func_1(seq):
  lemp = []
  temp = func_word(seq)
  for ww in temp:
    sc = ''
    for unit in ww:
      sc+=unit+' '
    num = model.score(sc)
    lemp.append(num)
  return lemp

### 返回困惑度分数
def func_2(seq):
  nn = len(seq):
  sun_num = 0
  lemp = func_1(seq)
  for ss in lemp:
    sum_num += ss
  num = (sum_num/nn)*(-1)
  return num

 

  

 

posted @ 2019-11-22 16:01  胡~萝~卜  阅读(913)  评论(2编辑  收藏  举报