技术文档翻译-------glove readme(1)

 1 Package Contents
 2 To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by a single space. If your corpus has multiple documents, simply concatenate documents together with a single space. If your documents are particularly short, it's possible that padding the gap between documents with e.g. 5 "dummy" words will produce better vectors. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in demo.sh, which you can modify as necessary.
 3 
 4 This four main tools in this package are:
 5 
 6 1) vocab_count
 7 This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.
 8 
 9 2) cooccur
10 Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by vocab_count, and may specify a variety of parameters, as described by running ./build/cooccur.
11 
12 3) shuffle
13 Shuffles the binary file of cooccurrence statistics produced by cooccur. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running ./build/shuffle.
14 
15 4) glove
16 Train the GloVe model on the specified cooccurrence data, which typically will be the output of the shuffle tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running ./build/glove.
 1 如果你要训练你自己的glove词向量,那么你首先需要把准备一个包含你语料集的单独文件,格式要求,文件中的词都用一个空格隔开。如果你的语料集有多个文档,请用两两之间用空格连接起来。如果你的文档都非常的短,你可以用5个"dummy"单词来填充文档,这样可以产生更好的词向量。一旦你创建了语料库,你就可以用以下4个工具进行glove词向量训练了。demo.sh中包含一个示例,可以再必要的时候修改它。
 2 
 3 攻击包中主要的四个工具如下所示:
 41) vocab_count
 5         这个工具要求输入的语料库已经是以空格分隔的标准格式。它会首先使用类似Stanford  Tokenizer 的方式作用在文本上,它会对语料库中的一元词进行统计计数,并根据总词汇量或者最小词频计数来选择阈值得到最终结果
 62)ooccur 
 7         从语聊库构建词-词共生统计,用户应该提供一个由vocab_count得到的词汇表文件,同时需要指定一系列参数, 就像运行./build/cooccur时显示的描述样
 83)shuffle  
 9         混洗由cooccur生成二进制的共生统计结果文件。对于大文件,每个块都会在混合并混洗在一起然后存储并排列在磁盘阵列上。用户需要指定一些参数,如运行 ./build/shuffle时显示的那样。
10         
114) glove
12     
13         在指定的共生数据上训练glove模型,这通常是混洗工具(shuffle)输出的结果。用户应该提供一个由vocab_count得出的文件并指定一系列参数,如运行./build/glove描述的那样        

 

posted @ 2018-02-23 18:28  在路上-UP  阅读(280)  评论(0编辑  收藏  举报