Sequence Model - Natural Language Processing & Word Embeddings

Word Embeddings

Word Representation

1-hot representation: any product of them is \(0\)
Featurized representation: word embedding

Visualizing word embeddings

t-SNE algorithm: \(300 \mathrm D \to 2 \mathrm D\)

learn the concepts that fell like they should be more related

Using word embeddings

Named entity recognition example

it will be much smaller in training sets and so this allows you to carry out transfer learning

Transfer learning and word embeddings

Learn word embeddings from large text corputs. (\(1 - 100\mathrm B\) words)

(or download pre-trained embedding online.)
Transfer embedding to new task with smaller training set.

(say, 100k words)
Optional: Continue to finetune word embeddings with new data

Properties of Word Embeddings

Analogies

\(\text{Man} \to \text{Woman } as \text{ King} \to ?\)

\(e_{\text{man}} - e_{\text{woman}} \approx \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} \approx e_{\text{king}} - e_{\text{queen}}\)

\(e_? \approx e_\text{king} - e_\text{man} + e_\text{woman} \approx e_{\text{queen}}\)

find a word \(w\) to satisfiy \(\argmax_w \text{sim}(e_w, e_\text{king} - e_\text{man} + e_\text{woman})\)

Cosine similarity
\[\text{sim}(u, v) = \frac{u^{T}v}{||u||_2 ||v||_2} \]

Embedding Matrix

Learning Word Embeddings: Word2vec & GloVe

Learning Word Embeddings

Neural language model

mask a word and build a network to predict the word, and get the parameters

Other context/target pairs

Context: Last 4 words / 4 words on left & right / Last 1 word / Neraby 1 word(skig gram)

\(\text{a glass of orange } \underline{?} \text{ to go along with}\)

Word2Vec

Skip-grams

come up with a few context to target errors to create our supervised learning problem

Model

\(\text{Vocab size} = 10000\)

\(\text{Context } c \text{ "orange"(6527)} = \text{Target } t \text{ "juice"(4834)}\)

\(O_c \to E \to e_c( = E \times O_c) \to o(\text{softmax}) \to \hat y\)

\[\text{softmax}: P(t | c) = \frac{e^{\theta_t^T e_c}}{\sum_{j = 1}^{10000} e^{\theta_j^T e_c}} \]
\(e_t\) is a parameter associated with output \(t\)

\[\text{Loss}: \mathcal L(\hat y, y) = - \sum_{i = 1}^{10000} y_i \log \hat y_i \]
Problems with softmax classification

computation cost is too high
Solutions with softmax classification

hierarchical softmax classifier

Negative Sampling

context	word	target?
orange	juice	1
orange	king	0
orange	book	0
orange	the	0
orange	of	0

Defining a new learning problem & Model

pick a context word and a target word to get a positive example;
pick k random words in dictionary and the target word to get k negative examples.

\[k = \begin{cases} 5 \sim 20 & (\text{small dataset})\\ 2 \sim 5 & (\text{larget dataset}) \end{cases} \]
train 10000 binary classification problem ( \(k+1\) example ) instead of multiple classification(computation cost is much lower)

Selecting negative examples

\[P(w_i) = \frac{f(w_i)^{3/ 4}}{\sum_{j = 1}^{10000} f(w_j)^{3/4}} \]

\(f(w_i)\) represents the frequency of \(w_i\) .

GloVe Word Vectors

GloVe(global vectors for word representation)

\(X_{ct} = X_{ij} = \text{times } i \text{ appears in context } j\)

\(X_{ij} = X_{ji}\) represent how \(i, j\) close to each others

\[\min \sum_{i = 1}^{n} \sum_{j = 1}^n f(X_{ij})(\theta_i^T e_j + b_i + b_j' - \log X_{ij})^2 \]

\(f(X_{ij})\) is a weighting term:

\[f(X_{ij}) = \begin{cases} 0 & \text{if } X_{ij} = 0\\ \text{high} & \text{(stopwords) this, is, of, a, }\cdots\\ \text{low} & \text{(rare words) durian, }\cdots \end{cases} \]

(regarding \(0 \log 0 = 0\) )

\(\theta_i\) and \(e_j\) are symmetric so you can calculate
\(\displaystyle e_w^{\text{final}} = \frac{e_w + \theta_w}{2}\) .

Applications Using Word Embeddings

Sentiment Classification

Average the word embeddings of the sentence and use a softmax to predict

But it makes some mistakes, e.x. "Completely lacking in good taste, good service, and good ambience."

RNN for sentiment classification

Use the many-to-one RNN (input the word embeddings) can solve this problem.

Debiasing word embeddings

Word embeddings can reflect gender, ethnicity, age, sexual, orientation, and other biases of the text used to train the model.

Addressing bias in word embeddings

Indentify bias direction

average
\( \begin{cases} e_{\text{he}} - e_{\text{she}}\\ e_{\text{male}} - e_{\text{female}}\\ \dots \end{cases} \)

bias direction( \(1\text{ D}\) )

non-bias direction( \(n-1\text{ D}\) )

SVU(singluar vale decomposition, like PCA) can solve it
Neutralize: For every word that is not definitional, project to get rid of bias

(need to figure out which words should be neutralize, use SVM first to classify)
Equalize pairs.

grandmother - grandfater have the same similarity and distance(gender neural)

you can handpick them(they are not so much)

Homework - Emojify

Building the Emojifier-V2

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: Emojify_V2

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the Emojify-v2 model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    ### START CODE HERE ###
    # Define sentence_indices as the input of the graph.
    # It should be of shape input_shape and dtype 'int32' (as it contains indices, which are integers).
    sentence_indices = Input(input_shape, dtype = 'int32')
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer
    # (See additional hints in the instructions).
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # The returned output should be a batch of sequences.
    X = LSTM(128, return_sequences = True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # The returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128, return_sequences = False)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X through a Dense layer with 5 units
    X = Dense(5)(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs = sentence_indices, outputs = X)
    
    ### END CODE HERE ###
    
    return model

model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 10)]              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 10, 50)            20000050  
_________________________________________________________________
lstm_2 (LSTM)                (None, 10, 128)           91648     
_________________________________________________________________
dropout_2 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645       
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0         
=================================================================
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050
_________________________________________________________________

Compile it

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

Train it

X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

posted @ 2021-08-15 00:26 zjp_shadow 阅读(144) 评论(0) 编辑收藏举报

刷新页面返回顶部

zjp_shadow

世界是个回音谷，念念不忘，必有回响，你大声喊唱，山谷雷鸣，音传千里，一叠一叠，一浪一浪，彼岸世界都收到了。凡事念念不忘，必有回响。因为它在传递你心间的声音，绵绵不绝，遂相印于心。