word2vec学习笔记

word2vec学习笔记

前言

最近一个月事情多,心力交瘁,临近过年这几天进入到啥也不想干的状态,要想摆脱这种状态最好的方法就是赶紧看书写东西,给自己一些正反馈,走出负面循环。过完年要做一些NLP相关的事情了,所有要大致了解下相关内容,第一个准备深入了解的就是word2vec,这是一种词嵌入模型主要作用就是为语言单词寻找一种尽可能合理的向量化表示,一方面能保持单词的一些语义特征(如相似性);另一方面能是向量维度大小比较合理。Word2vec是身兼这两种特点的词嵌入表示。当然没有免费的午餐,我们要通过训练得到这种表达。NLP和CV对待特征的思路很不一样,这也是我刚入NLP的感觉。

word2vec理论

这部分要仔细写起来很纠结,网上也有一堆类似的教程,我就不做详细介绍了,这里只讲个大概。一下内容大多来自standford CS224d lecture1。NLP需要先将文档进行分词然后对分词进行编码,编码最简单的就是One-hot vector一个单词占一个坑,但是这样一方面一个单词的维度过高,另一方面无法表达向量之间的关系。word2vec有前端和后端之分,前端有CBOW和SKIP-GRAM这两种模型,后端有负采样和哈弗曼树这两种模型,前端和后端可以自由组合。不过常用的高效实现都是采用Skip-gram + 负采样.

Skip-gram

Skip-gram的原理是对输入的单词预测其上下文,比如有一句话是{“The”, “cat”, ”jumped”,”over”, “the”, “puddle”},skip-gram模型对输入中心词语"jumped"进行预测输出"jumped"的上下文“The”, “cat”, ”over”, “the”, “puddle”,听起来感觉很神奇。下面这张图片表示了Skip-gram模型运行的过程。Skip-gram本质上就是一个逻辑回归。

Skip-gram的运行方式主要有以下几步骤:

  1. 对单词生成one-hot输入向量\(x_k\)
  2. 得到上下文的嵌入词向量\(v_c = Vx\)
  3. 通过\(u = Uu_c\)产生2m个得分向量\(u_{c-m},...,u_{c-1},u_{c+1},...,u_{c+m}\)
  4. 将分向量转换成概率分布\(y=softmax(u)\)
  5. 最后将产生的概率与真实的概率分布做匹配
    Skip-gram的目标/损失函数如下:

\[\begin{eqnarray} minimize L &=& -logP(w_{c-m},...,w_{c-1},w_{c+1},...,w_{c+m}|w_c) \\ &=& -log\prod_{j=0,j\not=m}^{2m}P(w_{c-m+j}|w_c)\\ &=& -log\prod_{j=0,j\not=m}^{2m}P(u_{c-m+j}|u_c)\\ &=& -log\prod_{j=0,j\not=m}^{2m}\frac{exp(u^T_{c-m+j}v_c)}{\sum^{|V|}_{k=1}exp(u_k^Vv_c)}\\ &=& -\sum_{j=0,j\not=m}^{2m}u^T_{c-m+j}v_c + 2mlog\sum_{k=1}^{|V|}exp(u_k^Tv_c) \end{eqnarray} \]

负采样

上面的目标/损失函数需要对整个词汇表\(|V|\)进行计算,代价非常的高,因此引入了负采样。负采样的思想是:我们不用去循环整个单词表,而只是采样一些负面的样本就够了,其概率分布与单词表中的频率相匹配。考虑一个词的"词-上下文"对\((w,c)\),令\(P(D=1|w,c)\)\((w,c)\)来自语料库的概率,则\(P(D=1|w,c)\)为不是来自语料库的概率,我们有:

\[P(D=1|w,c,\theta)=\frac{1}{1+e^{-v^T_cv_w}} \]

我们需要建立一个新的目标函数。如果\((w,c)\)真是来自与语料库,目标函数能够最大化\(P(D=1|w,c)\)。我们可以采用最大似然估计来得到模型参数。

\[\begin{eqnarray} \theta &=&\mathop{argmax}_{\theta}\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde{D}}P(D=0|w,c,\theta)\\ &=&\mathop{argmax}_{\theta}\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde{D}}(1-P(D=1|w,c,\theta))\\ &=&\mathop{argmax}_{\theta}\sum_{(w,c)\in D}log\frac{1}{1+exp(-u^T_wv_c)}+\sum_{(w,c)\in \tilde{D}}log(1-\frac{1}{1+exp(-u^T_wv_c)}) \\ &=&\mathop{argmax}_{\theta}\sum_{(w,c)\in D}log\frac{1}{1+exp(-u^T_wv_c)}+\sum_{(w,c)\in \tilde{D}}log\frac{1}{1+exp(u^T_wv_c)} \\ &=&\mathop{argmax}_{\theta}\sum_{(w,c)\in D}log\sigma(-u^T_wv_c)+\sum_{(w,c)\in \tilde{D}}log\sigma(u^T_wv_c)\\ \end{eqnarray} \]

这是的\(\theta\)可以看做是上面的\(U,V\)\(\tilde{D}\)表示负面的语料库。我们可一进一步把目标函数写成:

\[\begin{eqnarray} log\sigma(-u_{c-m+j}^Tv_c) + \sum^{K}_{k=1}log\sigma(\tilde{u}^T_kv_c) \end{eqnarray} \]

这里\(\tilde{u}_k\)是由负采样得到。

基于tensorflow的word2vec实现

上面大概介绍了一下word2vec的原理,讲的很简略,要想仔细了解还是去看看网上的《word2vec的数学原理》一文,下面介绍tensorflow里面自带的例子word2vec的实现。

# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
import seaborn as sbn
from matplotlib import pylab
%config InlineBackend.figure_format = 'svg'
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))

上面的代码主要功能是下载数据集并且读取数据,载入内存的是一个很长的文本序列。

vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

上面的代码短主要功能是为数据集进行编码,其中使用了most_common,所以单词会按照在文档中出现的次数进行编码,具体来说就是出现次数多的单词的编码会相对小一些,这个在后面负采样中会用到。

data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span) # deque窗口  大小为 2*skip_window + 1
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):  #两层循环,一个batch有batch/num_skips个数据,每个数据的label大小为num_skips
    target = skip_window  # target label at the center of the buffer
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print(batch)
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

对于data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']上面的操作会形成一个这样的输出 batch中存储的是id, 假设我们去skip_size = 4, skip_window = 2那么,单词 as 所对应的context的word个数就是4个,所以batch中有4个as, 所对应的就是context中的word
12 as -> 195 term
12 as -> 5239 anarchism
12 as -> 6 a
12 as -> 3084 originated
6 a -> 12 as
6 a -> 3084 originated
6 a -> 2 of
6 a -> 195 term

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset) #其实就是按照train_dataset顺序返回embeddings中的第train_dataset行。
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.nce_loss(softmax_weights, softmax_biases, embed,
                               train_labels, num_sampled, vocabulary_size))#是对类别太多的情况下loss计算的一种加速方法,具体可以参考文档

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

上面的代码就是tensorflow实现的word2vec的skip-gram模型,本质上就是一个逻辑回归啊,和上面的理论还是有区别的,不过这里用的到了nce_loss,这个函数里面包括了negtive sample,后面会详细介绍。

num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()


num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.savefig('softmax_loss.svg', format='svg')
  pylab.show()
  

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)

最后得到的结果如下

nce_loss

nce_loss的源码如下

def nce_loss(weights, #[num_classes, dim] dim就是emdedding_size
             biases,  #[num_classes] num_classes就是word的个数(不包括重复的)
             inputs, #[batch_size, dim]
             labels,  #[batch_size, num_true] 这里,我们的num_true设置为1,就是一个输入对应一个输出
             num_sampled,#要取的负样本的个数(per batch)
             num_classes,#类别的个数(在这里就是word的个数(不包含重复的))
             num_true=1,
             sampled_values=None,
             remove_accidental_hits=False,
             partition_strategy="mod",
             name="nce_loss"):
      logits, labels = _compute_sampled_logits(
      weights,
      biases,
      inputs,
      labels,
      num_sampled,
      num_classes,
      num_true=num_true,
      sampled_values=sampled_values,
      subtract_log_q=True,
      remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy,
      name=name)
  sampled_losses = sigmoid_cross_entropy_with_logits(
      logits, labels, name="sampled_losses") 
      #此函数返回的tensor与输入logits同维度。 _sum_rows之后,就得到了每个样本的corss entropy。
  # sampled_losses is batch_size x {true_loss, sampled_losses...}
  # We sum out true and sampled losses.
  return _sum_rows(sampled_losses)
  #在word2vec中对此函数的返回调用了reduce_mean() 就获得了平均 cross entropy

# _compute_sampled_logits源码如下
def _compute_sampled_logits(weights,
                            biases,
                            inputs,
                            labels,
                            num_sampled,
                            num_classes,
                            num_true=1,
                            sampled_values=None,
                            subtract_log_q=True,
                            remove_accidental_hits=False,
                            partition_strategy="mod",
                            name=None):
  if not isinstance(weights, list):
    weights = [weights]

  with ops.op_scope(weights + [biases, inputs, labels], name,
                    "compute_sampled_logits"):
    if labels.dtype != dtypes.int64:
      labels = math_ops.cast(labels, dtypes.int64)
    labels_flat = array_ops.reshape(labels, [-1])

    # Sample the negative labels.
    #   sampled shape: [num_sampled] tensor
    #   true_expected_count shape = [batch_size, 1] tensor
    #   sampled_expected_count shape = [num_sampled] tensor
    if sampled_values is None:
      sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
          true_classes=labels,
          num_true=num_true,
          num_sampled=num_sampled,
          unique=True,
          range_max=num_classes)

NOTE:这个函数是通过log-uniform进行取样的\(P(class)=\frac{(log(class+2)−log(class+1))}{log(rang\_max+1)}\),取样范围是[0, range_max] ,用这种方法取样就要求我们的word是按照频率从高到低排列的。之前对word的处理的确是这样,class越小取的概率越大。

sampled_softmax_loss

tensorflow的word2vec有的版本的损失函数用到了sampled_softmax_loss他和nce_loss很相似,参数是一模一样的。

def sampled_softmax_loss(weights,
                         biases,
                         labels,
                         inputs,
                         num_sampled,
                         num_classes,
                         num_true=1,
                         sampled_values=None,
                         remove_accidental_hits=True,
                         partition_strategy="mod",
                         name="sampled_softmax_loss"):
  logits, labels = _compute_sampled_logits(
      weights=weights,
      biases=biases,
      labels=labels,
      inputs=inputs,
      num_sampled=num_sampled,
      num_classes=num_classes,
      num_true=num_true,
      sampled_values=sampled_values,
      subtract_log_q=True,
      remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy,
      name=name)
  sampled_losses = nn_ops.softmax_cross_entropy_with_logits(labels=labels,
                                                            logits=logits)
  # sampled_losses is a [batch_size] tensor.
  return sampled_losses

主要区别就是sigmoid_cross_entropy_with_logits和softmax_cross_entropy_with_logits,前者不要求类别之间是互斥的,后者要求是互斥的。nce_loss得到的结果会更加平滑一些。下面贴出了用sampled_softmax_loss得到的结果

参考

暂略

posted @ 2017-01-26 13:28  liujshi  阅读(2527)  评论(1编辑  收藏  举报
MathJax.Hub.Config({ jax: ["input/TeX","output/HTML-CSS"], displayAlign: "left" });