Moses训练与测试
参考:http://cache.baiducontent.com/c?m=9d78d513d9991cf00ffa940f47408f711925df252bd6a0502294ca5f92140d1a0771e3ca7c6251428d9a6b6770f4091dacae6965367337b7eddf893a82e8d36e78c83034015dd70149915feedc46549167cb04bfb81897adf04484afa28d804352ba44050d97f1fb1b5a03ca1ee71447f4a7e913025f61eafa3115e859003e9e5301e650f890256e7096f7ad0d10d42aa17611e1b834c07805b562b31f6c3003e012be52176072f74e54e2597841d7fc5d902d791c7df45fb3ce90eaf616df80bf76cbaf9cb82fe33fbb93bda72a1e2545fa53f8f6e0ec643f0315d9bc85568574e2a5fbba3ab24896560fe40325693093378382f904ae344df4912ebe7271783f0aa9ef29b92e2c3a2c&p=8562c54ad5c34bf543f6d52d02148e&newp=9f34c54ad5c34beb2ab1c02d021496231610db2151d4d4103ba6cf1c&user=baidu&fm=sc&query=/home/xdj/mtworkdir/irstlm/irstlm-master/scripts/build-lm.sh+-i+b.sb.cn+-t+./tmp+-p+-s+improved-knes&qid=b51a28c7000049a6&p1=1
http://blog.csdn.net/han_xiaoyang/article/details/10109053
http://www.leexiang.com/how-to-run-moses
http://wenku.baidu.com/link?url=QvfbyTEEdOIrtvnxuh4NZLA8UqMq4stOiq6TUafNNmyC4qBChQJ3CVHL4_23c-GI4tX9wlC85aSfLa1dxHNNTP1DPaLdgzQSXY-mTSU5n3q
在构造测试文本。
在终端文件夹~/mtworkdir/mosesdecoder/lixiang1中:
/home/xdj/mtworkdir/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <b.en> b.tok.en
/home/xdj/mtworkdir/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <b.cn> b.tok.cn
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/train-truecaser.perl --corpus b.tok.en --model b.model.en
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/train-truecaser.perl --corpus b.tok.cn --model b.model.cn
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/truecase.perl --model b.model.en<b.tok.en>b.true.en
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/truecase.perl --model b.model.cn<b.tok.cn>b.true.cn
/home/xdj/mtworkdir/mosesdecoder/scripts/training/clean-corpus-n.perl b.true cn en b.clean 1 80
80代表分词的个数。本数据可取30。
/home/xdj/mtworkdir/irstlm/irstlm-master/scripts/add-start-end.sh <b.clean.cn>b.sb.cn
/home/xdj/mtworkdir/irstlm/irstlm-master/scripts/add-start-end.sh <b.clean.en>b.sb.en
运行时报错如下:
Set irstlm
这里要声明IRSTLM的安装路径:
export IRSTLM=/home/xdj/mtworkdir/irstlm
/home/xdj/mtworkdir/irstlm/irstlm-master/scripts/build-lm.sh -i b.sb.cn -t ./tmp -p -s improved-kneser-ney -o b.lm.cn
/home/xdj/mtworkdir/irstlm/irstlm-master/scripts/build-lm.sh -i b.sb.en -t ./tmp -p -s improved-kneser-ney -o b.lm.en
/home/xdj/mtworkdir/irstlm/bin/compile-lm --\text=yes b.lm.cn.gz b.arpa.cn
/home/xdj/mtworkdir/irstlm/bin/compile-lm --\text=yes b.lm.en.gz b.arpa.en
关键参考:https://github.com/irstlm-team/irstlm/issues/2
/home/xdj/mtworkdir/mosesdecoder/bin/build_binary b.arpa.cn b.blm.cn
/home/xdj/mtworkdir/mosesdecoder/bin/build_binary b.arpa.en b.blm.en
测试一下训练的模型
echo "我 果断 放弃 了 那幅 图 。" | /home/xdj/mtworkdir/mosesdecoder/bin/query b.blm.en
nohup nice /home/xdj/mtworkdir/mosesdecoder/scripts/training/train-model.perl -cores 1-parallel -root-dir train -corpus /home/xdj/mtworkdir/mosesdecoder/lixiang1/b.clean -f cn -e en -alignment grow-diag-fial-and -reordering msd-bidirectional-fe -lm 0:3:/home/xdj/mtworkdir/mosesdecoder/lixiang1/b.blm.en:8 -enternal-bin-dir /home/xdj/mtworkdir/giza-pp/GIZA++-v2>&training.out&
cd mtworkdir/mosesdecoder/lixiang1/working
nohup nice /home/xdj/mtworkdir/mosesdecoder/scripts/training/train-model.perl -cores 1 -root-dir train-\corpus /home/xdj/mtworkdir/mosesdecoder/lixiang1/b.clean -f cn -e en-alignment grow-diag-fial-and-\reordering msd-bidirectional-fe-lm 0:3:/home/xdj/mtworkdir/mosesdecoder/lixiang1/b.blm.en:8 -external-bin-dir /home/xdj/mtworkdir/giza-pp/GIZA++-v2/giza >& training.out &
nohup nice /home/yaoqiang/moses/moses_binary/scripts/training/train-model.perl -cores 8 -root-dir train
-\ corpus /data/train_500m_data/all_movie_data_20130422.clean -f zh -e en
-alignment grow-diag-final-and
-\reordering msd- bidirectional-fe
-lm 0:3:/data/train_500m_data/all_movie_data_20130422.blm.en:8
-external-bin-\dir /home/yaoqiang/moses/moses_binary/training-tools/giza >& training_log.out &
nohup nice /home/xdj/mtworkdir/mosesdecoder/scripts/training/train-model.perl
-scripts-root-dir /home/user/moses/scripts/target/scripts-20100105-1600
-root-dir /home/xdj/mtworkdir/mosesdecoder/lixiang1/working
-corpus /home/xdj/mtworkdir/mosesdecoder/lixiang1/working/train -e eng -f chn
-max-phrase-length 10
-alignment-factors grow-diag-final-and
-reordering msd-bidirectional-fe
-lm 0:5:/home/xdj/mtworkdir/mosesdecoder/lixiang1/working/train.chn.gz
nohup nice /home/yaoqiang/moses/moses_binary/scripts/training/train-model.perl -cores 8 -root-dir train -\ corpus /data/train_500m_data/b.clean -f zh -e en -alignment grow-diag-final-and -\reordering msd-
bidirectional-fe -lm 0:3:/data/train_500m_data/all_movie_data_20130422.blm.en:8 -external-bin-\
dir /home/yaoqiang/moses/moses_binary/training-tools/giza >& training_log.out &
其中参数-cores 8将服务器中8个cpu全都用上了。
nohup nice /home/xdj/mtworkdir/mosesdecoder/scripts/training/train-model.perl -cores 1 -root-dir train --corpus /home/xdj/mtworkdir/mosesdecoder/lixiang/b.clean -f cn -e en --alignment grow-diag-fial-and-\reordering msd-bidirectional-fe --lm 0:3:/home/xdj/mtworkdir/mosesdecoder/lixiang/b.blm.en:8 -external-bin-dir /home/xdj/mtworkdir/external-nal>& training.out &
nohup nice /home/xdj/mtworkdir/mosesdecoder/scripts/training/train-model.perl -cores 1 -root-dir train --corpus /home/xdj/mtworkdir/mosesdecoder/lixiang1/b.clean -f cn -e en --alignment grow-diag-fial-and-\reordering msd-bidirectional-fe --lm 0:3:/home/xdj/mtworkdir/mosesdecoder/lixiang1/b.blm.en:8 -external-bin-dir /home/xdj/mtworkdir/external-nal &>training.out&
echo "我 果断 放弃 了 那幅 图 。" | /home/xdj/mtworkdir/mosesdecoder/bin/moses -f /home/xdj/mtworkdir/mosesdecoder/lixiang1/working/train/model/moses.ini >out
遇到lm/read_arpa.cc:151 in void lm::PositiveProbWarn::Warn(float) threw > FormatLoadException'. > Positive log probability 2.40965e-07 in the model. This is a bug in > IRSTLM; you can set config.positive_log_probability = SILENT or pass > -i to build_binary to substitute 0.0 for the log probability. Error > in the 3-gram at byte 195895800 Byte: 195895800 File: 2000.arpa.ar
解决:/home/xdj/mtworkdir/mosesdecoder/bin/build_binary -i b.arpa.en b.blm.en
调优:
/home/xdj/mtworkdir/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <btune.en> btune.tok.en
/home/xdj/mtworkdir/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <btune.cn> btune.tok.cn
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/train-truecaser.perl --corpus btune.tok.en --model btune.model.en
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/train-truecaser.perl --corpus btune.tok.cn --model btune.model.cn
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/truecase.perl --model btune.model.en<btune.tok.en>btune.true.en
/home/xdj/mtworkdir/mosesdecoder/scripts/recaser/truecase.perl --model btune.model.cn<btune.tok.cn>btune.true.cn
nohup nice /home/xdj/mtworkdir/mosesdecoder/scripts/training/mert-moses.pl btune.true.cn btune.true.en /home/xdj/mtworkdir/mosesdecoder/lixiang1/working/train/model/moses.ini --mertdir /home/xdj/mtworkdir/mosesdecoder/bin/ &>mert.out&
运行:
/home/xdj/mtworkdir/mosesdecoder/bin/moses -f /home/xdj/mtworkdir/mosesdecoder/lixiang1/working/train/model/moses.ini </home/xdj/mtworkdir/mosesdecoder/lixiang1/working/in > out