依存分析 Dependency Parsing

句子成分依存分析主要分为两种：句法级别的和语义级别的

依存句法分析 syntactic dependency parsing

语义依存分词 semantic dependency parsing

依存分析有两种类别的方法，基于转移的(transition-based)和基于图的(graph-based)。

Stanford NLP依存分析器训练

斯坦福nlp工具（Stanford CoreNLP）提供了依存（句法）分析功能，同时允许训练自己的依存分析模型数据。Standfornd DP（版本stanford nlp 3.9.2）使用的是Chen&Manning2014的一种神经网络方法，其训练需要使用两个关键数据，一是word embedding文件，该文件非完全必要，但能提供是最好的；另一个是DP训练数据。

训练中文DP分析器命令格式为：

java -cp <stanfor-nlp相关jar包路径> edu.stanford.nlp.parser.nndep.DependencyParser -tlp edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack -trainFile <训练数据文件> -embedFile <embedding文件> -model <输出模型文件名>

官方文档说明提供以CoNLL-X格式的文件，原文为：

Training your own parser

You can train a new dependency parser using your own data in the CoNLL-X data format. (Many dependency treebanks are provided in this format by default; even if not, conversion is often trivial.)

CoNLL-X格式文件只有7列，分别为ID,FORM,LEMMA,POS,FEAT,HEAD,DEPREL。CoNLL有12列，CoNLL-U有10列。Standfornd DP（截至2019-02）提供的CoNLL-X格式说明链接http://ilk.uvt.nl/conll/#dataformat已失效，可在 https://stackoverflow.com/questions/27416164/what-is-conll-data-format 找到CoNLL-X和CoNLL的格式说明，CoNLL-U格式说明可在 https://universaldependencies.org/docs/format.html 找到。

但以CoNLL-X格式提供时，训练过程很快以抛空指针异常NPE终止。经查源码（版本3.9.2），其大致过程是读取训练文件的每行，以符"\t"切割，若大于等于10列，则取第2列为词，第3/4列为pos，第8列为依存类型；列数不到10列的行（即包含空行）被略过，且认为是句间分割标识。以CoNLL-X的7列格式提供时，训练工具实际未解析到任何依存树。训练工具最终会将训练数据中的已知标签（List类型的变量knownLabes）通过IO写出，抛出的NPE是因在遍历写出knownLables时其中有个null元素，该null的来源是相关的rootLabel字符串未得到正确赋值，因为未解析到任何依存树也就未找到任何ROOT，变量rootLabel保持初始值null而被加入knownLabels进而导致异常。

根据对源码的分析可知，工具实际上要求CoNLL-U格式的训练数据而非CoNLL-X格式（不论从列数还是每列对应意义）；并且尽管训练过程与CoNLL-U格式的最后两列无关，但列内容仍然不能设置为空串，否则会在切割时（使用java String.split("\t")）会被直接丢掉，相当于格式中没有这两列，训练数据同样不会被正确解析；此外，句子编号（ID）应从1开始，而非0，HEAD列中对ROOT引用时对应值为0而非-1。

总结，有3点需要注意：

要求CoNLL-U格式的训练数据（而非CoNLL-X）
句子从1开始编号，依存ROOT的词的HEAD列值为0
训练数据文件最后两列事实上无关训练，但列内容不能是空串

附文件行切割及依存树解析的部分源码：

//file: edu.stanford.nlp.parser.nndep.DependencyParser.java(stanford-corenlp:3.9.2 )
public static void loadConllFile(String inFile, List<CoreMap> sents, List<DependencyTree> trees, boolean unlabeled, boolean cPOS)
  {
    CoreLabelTokenFactory tf = new CoreLabelTokenFactory(false);

    try (BufferedReader reader = IOUtils.readerFromString(inFile)) {

      List<CoreLabel> sentenceTokens = new ArrayList<>();
      DependencyTree tree = new DependencyTree();

      for (String line : IOUtils.getLineIterable(reader, false)) {
        String[] splits = line.split("\t");
        if (splits.length < 10) {
          if (sentenceTokens.size() > 0) {
            trees.add(tree);
            CoreMap sentence = new CoreLabel();
            sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
            sents.add(sentence);
            tree = new DependencyTree();
            sentenceTokens = new ArrayList<>();
          }
        } else {
          String word = splits[1],
                  pos = cPOS ? splits[3] : splits[4],
                  depType = splits[7];

          int head = -1;
          try {
            head = Integer.parseInt(splits[6]);
          } catch (NumberFormatException e) {
            continue;
          }

          CoreLabel token = tf.makeToken(word, 0, 0);
          token.setTag(pos);
          token.set(CoreAnnotations.CoNLLDepParentIndexAnnotation.class, head);
          token.set(CoreAnnotations.CoNLLDepTypeAnnotation.class, depType);
          sentenceTokens.add(token);

          if (!unlabeled)
            tree.add(head, depType);
          else
            tree.add(head, Config.UNKNOWN);
        }
      }
    } catch (IOException e) {
      throw new RuntimeIOException(e);
    }
  }

stanford nlp DP的训练有时会非常慢。

哈工大nlp LTP依存分析器训练

LTP基于cpp语言编写，未提供直接用于训练的二进制程序，需下载源码编译。
下载ltp源码，根据指导文档编译程序，编译生成的二进制程序tools/train/nndepparser learn用于训练依存分析器。必需的参数--embedding <FILE>embedding文件，--reference <FILE>训练数据文件，--model <FILE>模型输出路径。虽然需要提供embedding文件，但其内容可为空。命令：
tools/train/nndepparser learn --model ltp.nndep.model --embeding embed.txt --reference train-data.conllu

截至当前最新版Stanford NLP(3.9.2)的DP和LTP(3.4.0)的DP都利用Chen&Manning2014提出的方法。

HanLP依存分析

截至当前HanLP最新版1.7.2，其依存分析由LTP代码移植而来。

posted @ 2019-03-01 15:47 二球悬铃木阅读(2603) 评论(0) 收藏举报

刷新页面返回顶部

二球悬铃木

依存分析 Dependency Parsing

依存分析 Dependency Parsing

Stanford NLP依存分析器训练

Training your own parser

哈工大nlp LTP依存分析器训练

HanLP依存分析

公告