Tensorflow word2vec编译运行
- Word2vec 更完整版本(非demo)的代码在
tensorflow/models/embedding/
- 首先需要安装bazel 来进行编译
bazel可以下载最新的binary安装文件,这里下载0.1.0版本的bazel
https://github.com/bazelbuild/bazel/releases/download/0.1.0/bazel-0.1.0-installer-linux-x86_64.sh
貌似需要root安装
sh bazel-0.1.0-installer-linux-x86_64.sh
- 编译word2vec
参考README.md
bazel build -c opt tensorflow/models/embedding:all
- 下载训练和验证数据
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
wget https://word2vec.googlecode.com/svn/trunk/questions-words.txt
- 运行word2vec
pwd
/home/users/chenghuige/other/tensorflow/bazel-bin/tensorflow/models/embedding
执行命令
./word2vec_optimized --train_data ./data/text8 --eval_data ./data/questions-words.txt --save_path ./data/result/
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 24
I tensorflow/models/embedding/word2vec_kernels.cc:149] Data file: ./data/text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words.
Data file: ./data/text8
Vocab size: 71290 + UNK
Words per epoch: 17005207
Eval analogy file: ./data/questions-words.txt
Questions: 17827
Skipped: 1717
Epoch 1 Step 151381: lr = 0.023 words/sec = 25300
Eval 1419/17827 accuracy = 8.0%
Epoch 2 Step 302768: lr = 0.022 words/sec = 48503
Eval 2445/17827 accuracy = 13.7%
Epoch 3 Step 454147: lr = 0.020 words/sec = 46666
Eval 3211/17827 accuracy = 18.0%
Epoch 4 Step 605540: lr = 0.018 words/sec = 53928
Eval 3608/17827 accuracy = 20.2%
Epoch 5 Step 756907: lr = 0.017 words/sec = 81255
Eval 4081/17827 accuracy = 22.9%
Epoch 6 Step 908251: lr = 0.015 words/sec = 46954