Deep Learning for Natural Language Processeing:vector space models

三种分类:

term–document

word–context

pair–pattern

semantics:the meaning of a word, a phrase, a sentence, or any text in human language, and the study of such meaning

特点:

从语料库中自动获取信息,节省工作量

衡量词、短语、文档相似性

The Term–Document Matrix:行向量为词terms,列向量为文档documents

bag:可包含重复元素的集合,表示为矩阵X in which each column x:j corresponds to a bag, each row xi: corresponds to a unique member, and an element xij is the frequency of the i-th member in the j-th bag

word–context

The distributional hypothesis in linguistics is that words that occur in similar contexts
tend to have similar meanings

pair–pattern

mason : stone
carpenter : wood

X cuts Y

“X works with Y

extended distributional hypothesis, that patterns that co-occur with similar pairs tend to have similar meanings、

latent relation hypothesis is that pairs of words that co-occur in similar patterns
tend to have similar semantic relations

attributional similarity: word–context sima(a, b) ∈R

relational similarity:pair–pattern simr(a : b, c : d) ∈R

A token is a single instance of a symbol, whereas a type is a general class of tokens

Statistical semantics hypothesis

If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings

Bag of words hypothesis

If documents and pseudodocuments (queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings

Distributional hypothesis

If words have similar row vectors in a word–context matrix, then they tend to have similar meanings

Extended distributional hypothesis

If patterns have similar column vectors in a pair–pattern matrix, then they tend to express similar semantic relations

Latent relation hypothesis

If word pairs have similar row vectors in a pair–pattern matrix, then they tend to have similar semantic relations

Linguistic Processing for Vector Space Models

1.tokenize the raw text: decide what constitutes a term and how to extract terms from raw text

punctuation (e.g., don’t, Jane’s, and/or), hyphenation (e.g., state-of-the-art versus state of the art), and recognize multi-word terms (e.g., Barack Obama and ice hockey)

2.normalize the raw text: convert superficially different strings of characters to the same form

Case folding

3.annotate the raw text: mark identical strings of characters as being different

 

Mathematical Processing for Vector Space Models

1. generate a matrix of frequencies

First, scan sequentially through the corpus, recording events and their frequencies in a hash table, a database, or a search engine index. Second, use the resulting data structure to generate the frequency matrix, with a sparse matrix representation

2.adjust the weights of the elements in the matrix

tf-idf (term frequency × inverse document frequency) family of weighting functions

length normalization

Term weighting

Pointwise Mutual Information (PMI) problem: infrequent events

3.smooth the matrix to reduce the amount of random noise and to fill in some of the zero elements in a sparse matrix

Singular Value Decomposition (SVD)奇异值分解

latent meaning, noise reduction, high-order co-occurrence, and sparsity reduction

 

Optimizations and parallelization for similarity computing

sparse-matrix multiplication 相关性分解成三个部分,X的非零值,Y的非零值,X,Y中的非零值

分布式处理mapreduce hadoop

randomized algorithm: dimension reduction

machine learning

posted on 2016-10-17 21:06  Beginnerpatienceless  阅读(191)  评论(0编辑  收藏  举报