利用Fairseq训练新的机器翻译模型

利用Fairseq训练一个新的机器翻译模型，官方机器翻译(German-English)示例：Fairseq-Training a New Model。

数据预处理

进入fairseq/examples/translation目录下，执行sh prepare-iwslt14.sh。prepare-iwslt14.sh主要执行以下几个步骤。

下载数据

echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git

echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
git clone https://github.com/rsennrich/subword-nmt.git

...

URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz"
GZ=de-en.tgz

...

wget "$URL"

主要下载3个数据，分别是英德双语语料，以及mosesdecoder和subword_nmt两个工具库。

mosesdecoder是机器翻译中常用的工具，里面包含了很多有用的脚本。
subword_nmt根据训练数据建立subword词表，以及对训练集、测试集、验证集切分成subword的形式。

数据清洗


SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl

...

src=de
tgt=en
lang=de-en
prep=iwslt14.tokenized.de-en
tmp=$prep/tmp

...

echo "pre-processing train data..."
for l in $src $tgt; do
    f=train.tags.$lang.$l
    tok=train.tags.$lang.tok.$l

    cat $orig/$lang/$f | \
    grep -v '<url>' | \
    grep -v '<talkid>' | \
    grep -v '<keywords>' | \
    sed -e 's/<title>//g' | \
    sed -e 's/<\/title>//g' | \
    sed -e 's/<description>//g' | \
    sed -e 's/<\/description>//g' | \
    perl $TOKENIZER -threads 8 -l $l > $tmp/$tok
    echo ""
done
perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175
for l in $src $tgt; do
    perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l
done

在该任务中，使用sed清除HTML标签。mosesdecoder的scripts/tokenizer/tokenizer.perl对句子进行分词。scripts/training/clean-corpus-n.perl清理训练集中过长的句子，以及一些src和tgt的长度比过大的句子。scripts/tokenizer/lowercase.perl将所有文本转化为小写。

切分训练集、验证集和测试集


echo "creating train, valid, test..."
for l in $src $tgt; do
    awk '{if (NR%23 == 0)  print $0; }' $tmp/train.tags.de-en.$l > $tmp/valid.$l
    awk '{if (NR%23 != 0)  print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l

    cat $tmp/IWSLT14.TED.dev2010.de-en.$l \
        $tmp/IWSLT14.TEDX.dev2012.de-en.$l \
        $tmp/IWSLT14.TED.tst2010.de-en.$l \
        $tmp/IWSLT14.TED.tst2011.de-en.$l \
        $tmp/IWSLT14.TED.tst2012.de-en.$l \
        > $tmp/test.$l
done

从训练语料中学习BPE，并对数据集进行BPE切分


TRAIN=$tmp/train.en-de
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
    cat $tmp/train.$l >> $TRAIN
done

echo "learn_bpe.py on ${TRAIN}..."
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE

for L in $src $tgt; do
    for f in train.$L valid.$L test.$L; do
        echo "apply_bpe.py to ${f}..."
        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f
    done
done

learn_bpe.py的功能是从原始的训练集中学习一个subword的词表。
apply_bpe.py的功能是将刚才学到的词表对训练数据进行subword化。
BPE在NLP领域的应用：https://zhuanlan.zhihu.com/p/86965595

数据规范化

值得说明的是，上述步骤在不同的任务上，数据处理步骤可能有所差异。在该步骤中，将上述用shell脚本初步处理的数据进行规范化，规范化之后的数据作为模型的最终输入。

安装了Fairseq之后，Fairseq就会把fairseq-preprocess等注册到控制台，如setup.py中所示：


    entry_points={
        'console_scripts': [
            'fairseq-eval-lm = fairseq_cli.eval_lm:cli_main',
            'fairseq-generate = fairseq_cli.generate:cli_main',
            'fairseq-interactive = fairseq_cli.interactive:cli_main',
            'fairseq-preprocess = fairseq_cli.preprocess:cli_main',
            'fairseq-score = fairseq_cli.score:cli_main',
            'fairseq-train = fairseq_cli.train:cli_main',
            'fairseq-validate = fairseq_cli.validate:cli_main',
        ],
    }

按照官方教程，可以执行：


TEXT=examples/translation/iwslt14.tokenized.de-en
fairseq-preprocess --source-lang de --target-lang en \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/iwslt14.tokenized.de-en

但是在实际使用过程中，发现有时候调用的Python版本不对，特别是使用了conda环境时，因此不如直接执行对应的Python脚本。此外，可以指定dataset-impl raw以生成文本形式的训练语料，便于理解和检查问题：

TEXT=examples/translation/iwslt14.tokenized.de-en
python fairseq-cli/preprocess.py --source-lang de --target-lang en \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/iwslt14.tokenized.de-en \
    --dataset-impl raw

在该步骤中，主要是将训练语料放置到目标位置destdir，建立token-索引值词典，并且对训练语料进行二进制化。

除此之外，

Fairseq需要将args.***pref和source-lang,target-lang组合起来查找语料，因此source-lang,target-lang需要和之前的语言简写保持一致。Fairseq寻找的语料位置：{args.***pref.xxx}-{lang}，其中，***为train/valid/test，xxx为source/target。另外，source-lang,target-lang指定字典命名，输出的字典名为：dict.{xxx-lang}，其中，xxx为source/target。
destdir用于指定输出的训练语料位置。

训练

创建存放模型的文件夹


mkdir -p checkpoints/fconv

启动训练

同样地，直接执行对应的Python脚本：

CUDA_VISIBLE_DEVICES=0 python fairseq-cli/train.py data-bin/iwslt14.tokenized.de-en \
    --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
    --dataset-impl raw \
    --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

默认情况下，Fairseq使用机器上的所有GPU，在这个例子中，通过指定CUDA_VISIBLE_DEVICES=0使用机器上编号为0的GPU。由于上一个步骤中，指定数据集形式为raw，因此在这一步骤中，训练集的形式应明确指定为raw。另外，通过指定max-tokens，Fairseq自行决定batch_size。

除此之外，在上述的示例中，

第一个无名参数data-bin/iwslt14.tokenized.de-en用于指定训练语料的父目录。
lr指定学习率，clip-norm指定梯度的最大范数，参见：torch.nn.utils.clip_grad_norm_，dropout指定dropout的丢弃率。
arch指定训练的具体模型，可在fairseq/models寻找到定义的模型结构。model定义抽象模型，arch定义具体的模型结构，比如多少词嵌入维度，多少个隐藏层等。

生成

在该步骤中，不使用官方教程上面的generate，因为其无法指定输入文件，改用interactive，并使用--input指定输入的测试文本。

python fairseq-cli/interactive.py data-bin/iwslt14.tokenized.de-en \
    --input evalution.txt \
    --path checkpoints/fconv/checkpoint_best.pt \
    --batch-size 128 --beam 5

第一个无名参数data-bin/iwslt14.tokenized.de-en用于指定语料的父目录。
input指定用于预测的语料路径。
path指定训练好的模型路径。
beam指定束搜索（beam search）的束大小。参见：https://www.jianshu.com/p/c2420ff9744a

posted @ 2020-08-23 18:07 冬色阅读(3397) 评论(0) 编辑收藏举报

刷新页面返回顶部

冬色

GitHub: https://github.com/cnlinxi