代码改变世界

nlp gensim fasttext word2vec

2022-05-09 17:03  brookin  阅读(67)  评论(0编辑  收藏  举报

gensim train model error assert vocab_n == len(model.wv.vocab)

https://github.com/RaRe-Technologies/gensim/issues/2853
fixed in new version

pip install gensim -U

gensim train fasttext model

https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html?highlight=fasttext#module-gensim.models.fasttext

import json
import gensim

with open("train_voc.json", "r") as file:
	sents = json.load(file)

model = gensim.models.fasttext.load_facebook_model("cc.de.300.bin")
model.build_vocab(sents, update=True)
model.train(corpus_iterable=sents, total_examples=len(sents), epochs=2)
gensim.models.fasttext.save_facebook_model(model, "cc.de.300.tuned.bin")

train_voc.json 格式

[
  [
	"This",
	"module",
	"allows",
	"training",
	"word",
	"embeddings",
	"from",
	"a",
	"training",
	"corpus"
  ],
  [
	"The",
	"additional",
	"ability",
	"to",
	"obtain",
	"word",
	"vectors",
	"for",
	"out-of-vocabulary",
	"words"
  ]
]