[吴恩达团队自然语言处理第二课_3]词嵌入与神经网络

[吴恩达团队自然语言处理第二课_3]词嵌入与神经网络

Overview

主要应用

  • Semantic analogies and similarity
  • Sentiment analysis
  • Classification of
    customer feedback

高级应用

  • Machine translation
  • Information extraction
  • Question answering

Learning objectives

  • Identify the key concepts of word representations
  • Generate word embeddings
  • Prepare text for machine learning
  • Implement the continuous bag-of-words model

Basic word representation

Integers

给词汇表每个单词分配一个唯一整数

image-20220222224818873
  • 简单

  • Nunber按字母顺序排列,没有什么语义,如happy的值比hand大没有原因

    image-20220222225004162

One-hot vectors

image-20220222230059872 image-20220222230158138
  • 简单

  • No implied ordering不暗示两个词有任何关系

  • Huge vectors可能超过一百万行

    image-20220222230547237
  • No embedded meaning 不包含单词意思,如无法计算相似度,happy应该更接近excited

image-20220222230723520

Word embeddings

Meaning as vectors

image-20220222231543561

可以发现happpyexcited更接近而不是paper

image-20220222231727885

获得了一些语义但是也放弃了一些精确度,精确度不如one-hot,因为在此2D图中,两个单词可能位于同一点,如spidersnake

word embeddings

  • 维数低

  • Embed meaning 有语义

    • e.g. semantic distance

      foresttree forestticket

    • e.g. analogies类比

      Paris:France::Rome:?

Terminology

image-20220222232549076

Summary

  • Words as integers
  • Words as vectors.
    • One-hot vectors
    • Word embedding vectors
  • Benefits of word embeddings for NLP

How to create word embedding

image-20220223174950711

Methods

Basic word embedding methods

  • word2vec(Google,2013)
    • Continuous bag-of-words(CBOW) 根据给定的单词预测丢失的单词
    • Continuous skip-gram/Skip-gram with negative sampling(SGNS) 根据给定的词预测词
  • Global Vectors(GloVe)(Stanford,2014)
  • fast Text(Facebook,2016)
    • Supports out-of-vocabulary (OOV) words

Advanced word embedding methods

Deep learning, contextual embeddings 单词根据context有不同的embedding,增加了多义词或相似词

  • BERT(Google,2018)
  • ELMo (Allen Institute for Al,2018)
  • GPT-2(OpenAl,2018)

以上都 Tunable pre-trained models available

Continuous bag-of-words(CBOW) model

image-20220223180041483

基于周围单词得缺失单词

如果两个词周围的单词很像,那么他们可能语义相关

image-20220223180311454

Create a training example

image-20220223180423738

C是模型超参,可改,将中心词加上上下文词称为window,此处window=1+2+2=5

image-20220223180605876

From corpus to training

e.g. 刚开始window内容为I am happy because I,然后window像右边滑动得am happy I am.

如此重复获得训练示例

image-20220223180949447

CBOW in a nutshell

image-20220223181115987

Cleaning and tokenization

image-20220223181447627

Example in Python: corpus

image-20220223181544942
# pip install nltk
# pip instalL emoji

import nltk
from nltk.tokenize import word_tokenize
import emoji
import re

nltk.download('punkt')  # download pre-trained Punkt tokenizer for English
corpus = 'Who ♥"word embeddings" in 2020? I do!!!'
data = re.sub(r'[,!2;-]+','.', corpus)
data = nltk.word_tokenize(data)# tokenize string to words
data = [ ch.lower() for ch in data
         if ch. isalpha()
         or ch == '.'
         or emoji.get_emoji_regexp().search(ch)
         ]
print(data)
#['who', '♥', 'word', 'embeddings', 'in', 'i', 'do', '.']
#Sliding window of words
def get_windows(words,C):
    i=C
    while i<len(words)-C:
        center_word=words[i]
        context_words=words[(i-C):i]+words[(i+1):(i+C+1)]
        yield context_words,center_word
        i += 1
image-20220223183158009 image-20220223183227830
for x,y in get_windows(
			['i','am','happy','because','i','am','learning'],
    		2
		):
	print(f'{x}\t{y}')
#['i', 'am', 'because', 'i']	happy
#['am', 'happy', 'i', 'am']	because
#['happy', 'because', 'am', 'learning']	i

Transforming words into vectors

image-20220223184401213

Average of individual one-hot vectors

image-20220223184510702

Final prpared training set

image-20220223184615652

vector都是行向量,这样写是为了好看

Architecture of the CBOW model

image-20220223190435424

Demension(single input)

image-20220223191044606

行列向量 转置

image-20220223191515925

Dimensions (batch input)

m : batch size 超参

组合成一个 V 行m 列的矩阵 称为 X

image-20220223192018015

y_hat分解为m个列向量

image-20220223192436437

Rectified Linear Unit(ReLU)

image-20220223211831238image-20220223212148972

Softmax

image-20220223211831238image-20220223212148972

e.g.

image-20220223212329501

Training a CBOW model

Loss

image-20220223212809888

Corss-entropy loss

常在分类中使用

\[J=-\sum_{k=1}^V y_k\log \hat y_k \]

image-20220223213108780 image-20220223213125223

image-20220223213239875image-20220223213401536

image-20220223213401536

\[J=-\log \hat y_{actual\ word} \]

image-20220223213506088

forward propagation

Training process

  • forward propagation
  • Cost
  • Backpropagation and gradient descent

forward propagation

image-20220223214230885

Cost

Cost: mean of losses

\[J_{batch}=-\frac{1}{m}\sum_{i=1}^m\sum_{j=1}^Vy_j^{(i) }\log \hat y_j^{(i)}\\ J_{batch}=-\frac{1}{m}\sum_{i=1}^mJ^{(i)} \]

image-20220223214703258

Minimizing the cost

  • Backpropagation:calculate partial derivatives of cost with respect to
    weights and biases

    \[\frac{\partial J_{batch}}{\partial W_1}, \frac{\partial J_{batch}}{\partial W_2}, \frac{\partial J_{batch}}{\partial b_1}, \frac{\partial J_{batch}}{\partial b_2} \]

    \[J_{batch}=f(W_1,W_2,b_1,b_2) \]

  • Gradient descent:update weights and biases

Backpropagation

image-20220223215651541

Gradient descent

image-20220223215821200

Extracting word embedding vectors

Option 1

image-20220223220117584

Option 2

image-20220223220211873

Option 3

取前两个平均值

image-20220223220303294

Evaluation word embedding

Intrinsic evaluation

Test relationships between words

  • Analogies

    • Semantic analogies
      “France"is to"Paris"as"Italy"is to<?>

    • Syntactic analogies
      “seen"is to"saw"as"been"is to<?>

    • Ambiguity
      “wolf”is to"pack"as"bee"is to <?>→swarm? colony?

    • 在化学元素对应做的不好,word2vec给出:

      image-20220223220934339
  • Clustering

    image-20220223221114537
  • Visualization

    image-20220223221212841

Extrinsic evaluation

Test word embeddings on external task

e.g. named entity recognition, parts-of-speech tagging

  • Evaluates actual usefulness of embeddings
  • 比内在评估更耗时
  • More difficult to troubleshoot

Conclusion

Recap and assignment

  • Data preparation
  • Word representations
  • Continuous bag-of-words model
  • Evaluation

Going further

  • Advanced language modelling and word embeddings
  • NLP and machine learning libraries
image-20220223221943171
posted @ 2022-02-23 22:23  付玬熙  阅读(67)  评论(0编辑  收藏  举报