Deep Learning for Natural Language Processeing:vector space models
三种分类:
term–document
word–context
pair–pattern
semantics:the meaning of a word, a phrase, a sentence, or any text in human language, and the study of such meaning
特点:
从语料库中自动获取信息,节省工作量
衡量词、短语、文档相似性
The Term–Document Matrix:行向量为词terms,列向量为文档documents
bag:可包含重复元素的集合,表示为矩阵X in which each column x:j corresponds to a bag, each row xi: corresponds to a unique member, and an element xij is the frequency of the i-th member in the j-th bag
word–context
The distributional hypothesis in linguistics is that words that occur in similar contexts
tend to have similar meanings
pair–pattern
mason : stone
carpenter : wood
X cuts Y
“X works with Y
extended distributional hypothesis, that patterns that co-occur with similar pairs tend to have similar meanings、
latent relation hypothesis is that pairs of words that co-occur in similar patterns
tend to have similar semantic relations
attributional similarity: word–context sima(a, b) ∈R
relational similarity:pair–pattern simr(a : b, c : d) ∈R
A token is a single instance of a symbol, whereas a type is a general class of tokens
Statistical semantics hypothesis
If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings
Bag of words hypothesis
If documents and pseudodocuments (queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings
Distributional hypothesis
If words have similar row vectors in a word–context matrix, then they tend to have similar meanings
Extended distributional hypothesis
If patterns have similar column vectors in a pair–pattern matrix, then they tend to express similar semantic relations
Latent relation hypothesis
If word pairs have similar row vectors in a pair–pattern matrix, then they tend to have similar semantic relations
Linguistic Processing for Vector Space Models
1.tokenize the raw text: decide what constitutes a term and how to extract terms from raw text
punctuation (e.g., don’t, Jane’s, and/or), hyphenation (e.g., state-of-the-art versus state of the art), and recognize multi-word terms (e.g., Barack Obama and ice hockey)
2.normalize the raw text: convert superficially different strings of characters to the same form
Case folding
3.annotate the raw text: mark identical strings of characters as being different
Mathematical Processing for Vector Space Models
1. generate a matrix of frequencies
First, scan sequentially through the corpus, recording events and their frequencies in a hash table, a database, or a search engine index. Second, use the resulting data structure to generate the frequency matrix, with a sparse matrix representation
2.adjust the weights of the elements in the matrix
tf-idf (term frequency × inverse document frequency) family of weighting functions
length normalization
Term weighting
Pointwise Mutual Information (PMI) problem: infrequent events
3.smooth the matrix to reduce the amount of random noise and to fill in some of the zero elements in a sparse matrix
Singular Value Decomposition (SVD)奇异值分解
latent meaning, noise reduction, high-order co-occurrence, and sparsity reduction
Optimizations and parallelization for similarity computing
sparse-matrix multiplication 相关性分解成三个部分,X的非零值,Y的非零值,X,Y中的非零值
分布式处理mapreduce hadoop
randomized algorithm: dimension reduction
machine learning
posted on 2016-10-17 21:06 Beginnerpatienceless 阅读(191) 评论(0) 编辑 收藏 举报