参考资料:

http://ir.dlut.edu.cn/NewsShow.aspx?ID=291

http://www.douban.com/note/298095260/

 

word2vec是NLP领域的重要算法,它的功能是将word用K维的dense vector来表达,训练集是语料库,不含标点,以空格断句。因此可以看作是种特征处理方法。

主要优点:

  • 加法操作。
  • 高效。单机可处理1小时2千万词。

google的开源版本比较权威,地址( http://word2vec.googlecode.com/svn/trunk/ ),不过我以spark版本学习的。

 

I.背景知识

Distributed representation,word的特征表达方式,通过训练将每个词映射成 K 维实数向量(K 一般为模型中的超参数),通过词之间的距离(比如 cosine 相似度、欧氏距离等)来判断它们之间的语义相似度。 

语言模型:n-gram等。

 

II.模型

0.word window构成context,对于一个单词i,以$u_i$表示,它作为别的单词的context时用$v_i$表示(也即它作为context的表示是不同的)。只有word window内的word才被认为是context,并且是顺序无关的。

1.概率模型为\[ P=\sum lot p(u_i) ,\]i表示位置(或单词),也即各单词出现概率的累积函数。

2.以skip gram为例(CBOW条件概率反过来),则位置i的单词出现概率为

\[ p(u_i)=\sum_{-c\leq j\leq c,j\neq 0} p(v_{i+j}|u_{i}) \]

表示位置i只和其context有关。

3.条件概率$p(v_{i+j}|u_i)$ 通过softmax实现K维向量到概率的转化表达。

 

III.优化

最开始使用神经网络,后来用层次softmax等来降低时间复杂度。还用了很多trick,比如ExpTable。 

a) 删除隐藏层

b) 使用Hierarchical softmax 或negative sampling

c) 去除小于minCount的词

d)预先计算ExpTable

e) 根据一下公式算出每个词被选出的概率,如果选出来则不予更新。此方法可以节省时间而且可以提高非频繁词的准确度。

\[ prob(w)=1-\large(\sqrt{\frac{t}{f(w)}}+\frac{t}{f(w)}\large)  \]   其中$t$为设定好的阈值,$f(w)$ 为$w$出现的频率。

f) 选取邻近词的窗口大小不固定。有利于更加偏重于离自己近的词进行更新。

g)  多线程,无需考虑互斥。

 

IV.spark源码分析 

  1 /**
  2 * Licensed to the Apache Software Foundation (ASF) under one or more
  3 * contributor license agreements.  See the NOTICE file distributed with
  4 * this work for additional information regarding copyright ownership.
  5 * The ASF licenses this file to You under the Apache License, Version 2.0
  6 * (the "License"); you may not use this file except in compliance with
  7 * the License.  You may obtain a copy of the License at
  8 *
  9 *    http://www.apache.org/licenses/LICENSE-2.0
 10 *
 11 * Unless required by applicable law or agreed to in writing, software
 12 * distributed under the License is distributed on an "AS IS" BASIS,
 13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 * See the License for the specific language governing permissions and
 15 * limitations under the License.
 16 */
 17 
 18 package org.apache.spark.mllib.feature
 19 
 20 import java.lang.{Iterable => JavaIterable}
 21 
 22 import com.github.fommil.netlib.BLAS.{getInstance => blas}
 23 import org.apache.spark.Logging
 24 import org.apache.spark.SparkContext._
 25 import org.apache.spark.annotation.Experimental
 26 import org.apache.spark.api.java.JavaRDD
 27 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 28 import org.apache.spark.rdd.RDD
 29 import org.apache.spark.util.Utils
 30 import org.apache.spark.util.random.XORShiftRandom
 31 import scala.collection.JavaConverters._
 32 import scala.collection.mutable
 33 import scala.collection.mutable.ArrayBuffer
 34 
 35 
 36 /**
 37  *  Entry in vocabulary
 38  */
 39 private case class VocabWord(
 40                               var word: String,
 41                               var cn: Int,
 42                               var point: Array[Int],
 43                               var code: Array[Int],
 44                               var codeLen:Int
 45                               )
 46 
 47 /**
 48  * :: Experimental ::
 49  * Word2Vec creates vector representation of words in a text corpus.
 50  * The algorithm first constructs a vocabulary from the corpus
 51  * and then learns vector representation of words in the vocabulary.
 52  * The vector representation can be used as features in
 53  * natural language processing and machine learning algorithms.
 54  *
 55  * We used skip-gram model in our implementation and hierarchical softmax
 56  * method to train the model. The variable names in the implementation
 57  * matches the original C implementation.
 58  *
 59  * For original C implementation, see https://code.google.com/p/word2vec/
 60  * For research papers, see
 61  * Efficient Estimation of Word Representations in Vector Space
 62  * and
 63  * Distributed Representations of Words and Phrases and their Compositionality.
 64  */
 65 @Experimental
 66 class Word2VectorEX extends Serializable with Logging {
 67 
 68   private var vectorSize = 100
 69   private var startingAlpha = 0.025
 70   private var numPartitions = 1
 71   private var numIterations = 1
 72   private var seed = Utils.random.nextLong()
 73 
 74   /**
 75    * Sets vector size (default: 100).
 76    */
 77   def setVectorSize(vectorSize: Int): this.type = {
 78     this.vectorSize = vectorSize
 79     this
 80   }
 81 
 82   /**
 83    * Sets initial learning rate (default: 0.025).
 84    */
 85   def setLearningRate(learningRate: Double): this.type = {
 86     this.startingAlpha = learningRate
 87     this
 88   }
 89 
 90   /**
 91    * Sets number of partitions (default: 1). Use a small number for accuracy.
 92    */
 93   def setNumPartitions(numPartitions: Int): this.type = {
 94     require(numPartitions > 0, s"numPartitions must be greater than 0 but got $numPartitions")
 95     this.numPartitions = numPartitions
 96     this
 97   }
 98 
 99   /**
100    * Sets number of iterations (default: 1), which should be smaller than or equal to number of
101    * partitions.
102    */
103   def setNumIterations(numIterations: Int): this.type = {
104     this.numIterations = numIterations
105     this
106   }
107 
108   /**
109    * Sets random seed (default: a random long integer).
110    */
111   def setSeed(seed: Long): this.type = {
112     this.seed = seed
113     this
114   }
115 
116   private val EXP_TABLE_SIZE = 1000
117   private val MAX_EXP = 6
118   private val MAX_CODE_LENGTH = 40
119   private val MAX_SENTENCE_LENGTH = 1000
120 
121   /** context words from [-window, window] */
122   private val window = 5            //context 范围限定
123 
124   /** minimum frequency to consider a vocabulary word */
125   private val minCount = 5           //过滤单词阈值
126 
127   private var trainWordsCount = 0          //语料库总共词量(计重复出现)
128   private var vocabSize = 0                 //词表内单词总数
129   private var vocab: Array[VocabWord] = null      //词表
130   private var vocabHash = mutable.HashMap.empty[String, Int]      //词表反查索引
131 
132   private def learnVocab(words: RDD[String]): Unit = {           //构造词表,统计更新上面四个量
133     vocab = words.map(w => (w, 1))
134       .reduceByKey(_ + _)
135       .map(x => VocabWord(
136       x._1,
137       x._2,
138       new Array[Int](MAX_CODE_LENGTH),
139       new Array[Int](MAX_CODE_LENGTH),
140       0))
141       .filter(_.cn >= minCount)
142       .collect()
143       .sortWith((a, b) => a.cn > b.cn)
144 
145     vocabSize = vocab.length
146     var a = 0
147     while (a < vocabSize) {
148       vocabHash += vocab(a).word -> a
149       trainWordsCount += vocab(a).cn
150       a += 1
151     }
152     logInfo("trainWordsCount = " + trainWordsCount)
153   }
154 
155   private def createExpTable(): Array[Float] = {        //指数运算查表
156     val expTable = new Array[Float](EXP_TABLE_SIZE)
157     var i = 0
158     while (i < EXP_TABLE_SIZE) {
159       val tmp = math.exp((2.0 * i / EXP_TABLE_SIZE - 1.0) * MAX_EXP)
160       expTable(i) = (tmp / (tmp + 1.0)).toFloat
161       i += 1
162     }
163     expTable
164   }
165 
166   private def createBinaryTree(): Unit = {
167     val count = new Array[Long](vocabSize * 2 + 1)
168     val binary = new Array[Int](vocabSize * 2 + 1)
169     val parentNode = new Array[Int](vocabSize * 2 + 1)
170     val code = new Array[Int](MAX_CODE_LENGTH)
171     val point = new Array[Int](MAX_CODE_LENGTH)
172     var a = 0
173     while (a < vocabSize) {
174       count(a) = vocab(a).cn
175       a += 1
176     }
177     while (a < 2 * vocabSize) {
178       count(a) = 1e9.toInt
179       a += 1
180     }
181     var pos1 = vocabSize - 1
182     var pos2 = vocabSize
183 
184     var min1i = 0
185     var min2i = 0
186 
187     a = 0
188     while (a < vocabSize - 1) {
189       if (pos1 >= 0) {
190         if (count(pos1) < count(pos2)) {
191           min1i = pos1
192           pos1 -= 1
193         } else {
194           min1i = pos2
195           pos2 += 1
196         }
197       } else {
198         min1i = pos2
199         pos2 += 1
200       }
201       if (pos1 >= 0) {
202         if (count(pos1) < count(pos2)) {
203           min2i = pos1
204           pos1 -= 1
205         } else {
206           min2i = pos2
207           pos2 += 1
208         }
209       } else {
210         min2i = pos2
211         pos2 += 1
212       }
213       count(vocabSize + a) = count(min1i) + count(min2i)
214       parentNode(min1i) = vocabSize + a
215       parentNode(min2i) = vocabSize + a
216       binary(min2i) = 1
217       a += 1
218     }
219     // Now assign binary code to each vocabulary word
220     var i = 0
221     a = 0
222     while (a < vocabSize) {
223       var b = a
224       i = 0
225       while (b != vocabSize * 2 - 2) {
226         code(i) = binary(b)
227         point(i) = b
228         i += 1
229         b = parentNode(b)
230       }
231       vocab(a).codeLen = i
232       vocab(a).point(0) = vocabSize - 2
233       b = 0
234       while (b < i) {
235         vocab(a).code(i - b - 1) = code(b)
236         vocab(a).point(i - b) = point(b) - vocabSize
237         b += 1
238       }
239       a += 1
240     }
241   }
242 
243   /**
244    * Computes the vector representation of each word in vocabulary.
245    * @param dataset an RDD of words
246    * @return a Word2VecModel
247    */
248   def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VectorModel = {
249 
250     val words = dataset.flatMap(x => x)       //拉成词序列,句话断点通过Iterable来表征
251 
252     learnVocab(words)        //学习词库
253 
254     createBinaryTree()
255 
256     val sc = dataset.context
257 
258     val expTable = sc.broadcast(createExpTable())
259     val bcVocab = sc.broadcast(vocab)
260     val bcVocabHash = sc.broadcast(vocabHash)
261 
262     val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>            //按句子划分,单词以Int表征
263       new Iterator[Array[Int]] {
264         def hasNext: Boolean = iter.hasNext
265 
266         def next(): Array[Int] = {
267           var sentence = new ArrayBuffer[Int]
268           var sentenceLength = 0
269           while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
270             val word = bcVocabHash.value.get(iter.next())
271             word match {
272               case Some(w) =>
273                 sentence += w
274                 sentenceLength += 1
275               case None =>
276             }
277           }
278           sentence.toArray
279         }
280       }
281     }
282 
283     //Hierarchical Softmax
284     val newSentences = sentences.repartition(numPartitions).cache()
285     val initRandom = new XORShiftRandom(seed)
286     val syn0Global =
287       Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
288     val syn1Global = new Array[Float](vocabSize * vectorSize)
289     var alpha = startingAlpha
290     for (k <- 1 to numIterations) {
291       val partial = newSentences.mapPartitionsWithIndex { case (idx, iter) =>
292         val random = new XORShiftRandom(seed ^ ((idx + 1) << 16) ^ ((-k - 1) << 8))    //随机梯度下降
293         val syn0Modify = new Array[Int](vocabSize)
294         val syn1Modify = new Array[Int](vocabSize)
295         val model = iter.foldLeft((syn0Global, syn1Global, 0, 0)) {
296           case ((syn0, syn1, lastWordCount, wordCount), sentence) =>
297             var lwc = lastWordCount
298             var wc = wordCount
299             if (wordCount - lastWordCount > 10000) {
300               lwc = wordCount
301               // TODO: discount by iteration?
302               alpha =
303                 startingAlpha * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))
304               if (alpha < startingAlpha * 0.0001) alpha = startingAlpha * 0.0001
305               logInfo("wordCount = " + wordCount + ", alpha = " + alpha)
306             }
307             wc += sentence.size
308             var pos = 0
309             while (pos < sentence.size) {
310               val word = sentence(pos)
311               val b = random.nextInt(window)
312               // Train Skip-gram
313               var a = b
314               while (a < window * 2 + 1 - b) {
315                 if (a != window) {
316                   val c = pos - window + a
317                   if (c >= 0 && c < sentence.size) {
318                     val lastWord = sentence(c)
319                     val l1 = lastWord * vectorSize
320                     val neu1e = new Array[Float](vectorSize)
321                     // Hierarchical softmax
322                     var d = 0
323                     while (d < bcVocab.value(word).codeLen) {
324                       val inner = bcVocab.value(word).point(d)
325                       val l2 = inner * vectorSize
326                       // Propagate hidden -> output
327                       var f = blas.sdot(vectorSize, syn0, l1, 1, syn1, l2, 1)
328                       if (f > -MAX_EXP && f < MAX_EXP) {
329                         val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toInt
330                         f = expTable.value(ind)
331                         val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat
332                         blas.saxpy(vectorSize, g, syn1, l2, 1, neu1e, 0, 1)
333                         blas.saxpy(vectorSize, g, syn0, l1, 1, syn1, l2, 1)
334                         syn1Modify(inner) += 1
335                       }
336                       d += 1
337                     }
338                     blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn0, l1, 1)
339                     syn0Modify(lastWord) += 1
340                   }
341                 }
342                 a += 1
343               }
344               pos += 1
345             }
346             (syn0, syn1, lwc, wc)
347         }
348         val syn0Local = model._1
349         val syn1Local = model._2
350         // Only output modified vectors.
351         Iterator.tabulate(vocabSize) { index =>
352           if (syn0Modify(index) > 0) {
353             Some((index, syn0Local.slice(index * vectorSize, (index + 1) * vectorSize)))
354           } else {
355             None
356           }
357         }.flatten ++ Iterator.tabulate(vocabSize) { index =>
358           if (syn1Modify(index) > 0) {
359             Some((index + vocabSize, syn1Local.slice(index * vectorSize, (index + 1) * vectorSize)))
360           } else {
361             None
362           }
363         }.flatten
364       }
365       val synAgg = partial.reduceByKey { case (v1, v2) =>
366         blas.saxpy(vectorSize, 1.0f, v2, 1, v1, 1)
367         v1
368       }.collect()
369       var i = 0
370       while (i < synAgg.length) {
371         val index = synAgg(i)._1
372         if (index < vocabSize) {
373           Array.copy(synAgg(i)._2, 0, syn0Global, index * vectorSize, vectorSize)
374         } else {
375           Array.copy(synAgg(i)._2, 0, syn1Global, (index - vocabSize) * vectorSize, vectorSize)
376         }
377         i += 1
378       }
379     }
380     newSentences.unpersist()
381 
382     val word2VecMap = mutable.HashMap.empty[String, Array[Float]]
383     var i = 0
384     while (i < vocabSize) {
385       val word = bcVocab.value(i).word
386       val vector = new Array[Float](vectorSize)
387       Array.copy(syn0Global, i * vectorSize, vector, 0, vectorSize)
388       word2VecMap += word -> vector
389       i += 1
390     }
391 
392     new Word2VectorModel(word2VecMap.toMap)
393   }
394 
395   /**
396    * Computes the vector representation of each word in vocabulary (Java version).
397    * @param dataset a JavaRDD of words
398    * @return a Word2VecModel
399    */
400   def fit[S <: JavaIterable[String]](dataset: JavaRDD[S]): Word2VectorModel = {
401     fit(dataset.rdd.map(_.asScala))
402   }
403 
404 }
405 
406 /**
407  * :: Experimental ::
408  * Word2Vec model
409  */
410 @Experimental
411 class Word2VectorModel private[mllib] (
412                                      private  val model: Map[String, Array[Float]]) extends Serializable {
413 
414   private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
415     require(v1.length == v2.length, "Vectors should have the same length")
416     val n = v1.length
417     val norm1 = blas.snrm2(n, v1, 1)
418     val norm2 = blas.snrm2(n, v2, 1)
419     if (norm1 == 0 || norm2 == 0) return 0.0
420     blas.sdot(n, v1, 1, v2,1) / norm1 / norm2
421   }
422 
423   /**
424    * Transforms a word to its vector representation
425    * @param word a word
426    * @return vector representation of word
427    */
428   def transform(word: String): Vector = {
429     model.get(word) match {
430       case Some(vec) =>
431         Vectors.dense(vec.map(_.toDouble))
432       case None =>
433         throw new IllegalStateException(s"$word not in vocabulary")
434     }
435   }
436 
437   /**
438    * Find synonyms of a word
439    * @param word a word
440    * @param num number of synonyms to find
441    * @return array of (word, similarity)
442    */
443   def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
444     val vector = transform(word)
445     findSynonyms(vector,num)
446   }
447 
448   /**
449    * Find synonyms of the vector representation of a word
450    * @param vector vector representation of a word
451    * @param num number of synonyms to find
452    * @return array of (word, cosineSimilarity)
453    */
454   def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
455     require(num > 0, "Number of similar words should > 0")
456     // TODO: optimize top-k
457     val fVector = vector.toArray.map(_.toFloat)
458     model.mapValues(vec => cosineSimilarity(fVector, vec))
459       .toSeq
460       .sortBy(- _._2)
461       .take(num + 1)
462       .tail
463       .toArray
464   }
465 
466 
467   def getModel(): Map[String, Array[Float]] = {
468      model
469   }
470 
471 
472 }