CS100.1x-lab3_text_analysis_and_entity_resolution_student
这次作业叫Text Analysis and Entity Resolution,比前几次作业难度要大很多。相关ipynb文件见我github。
实体解析在数据清洗和数据整合中是一个很重要,且有难度的问题。这次作业将用Apache Spark和文本分析的方法应用到实体解析。实体解析是指,从不同的数据源的记录里找到相同的实体,当进行数据融合时,这个步骤是必须的。
这次作业的数据来源于metric-learning project,主要目录包括:
- Google.csv, the Google Products dataset
- Amazon.csv, the Amazon dataset
- Google_small.csv, 200 records sampled from the Google data
- Amazon_small.csv, 200 records sampled from the Amazon data
- Amazon_Google_perfectMapping.csv, the "gold standard" mapping
- stopwords.txt, a list of common English words
除此之外,作业还有一些样本数据用于Part 1和一个存储着google和amazon所有的实体的映射表,这个表是用来评价算法的性能。
Part 0 Preliminaries
下面我们要读取google和amazon的数据,并转化为RDD。其中,这两个数据集的格式是这样的。
The file format of an Amazon line is:
"id","title","description","manufacturer","price"
The file format of a Google line is:
"id","name","description","manufacturer","price"
我们这一步要把ID这一列抽取出来。google的数据集里面,ID是指url,而amazon里面,ID是指包括数字和字母的字符串。我们第一步就是把数据变为pair RDD的形式,其中,ID是key,name/title, description, and manufacturer是value。
import re
DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'
def removeQuotes(s):
""" Remove quotation marks from an input string
Args:
s (str): input string that might have the quote "" characters
Returns:
str: a string without the quote characters
"""
return ''.join(i for i in s if i!='"')
def parseDatafileLine(datafileLine):
""" Parse a line of the data file using the specified regular expression pattern
Args:
datafileLine (str): input string that is a line from the data file
Returns:
str: a string parsed using the given regular expression and without the quote characters
"""
match = re.search(DATAFILE_PATTERN, datafileLine)
if match is None:
print 'Invalid datafile line: %s' % datafileLine
return (datafileLine, -1)
elif match.group(1) == '"id"':
print 'Header datafile line: %s' % datafileLine
return (datafileLine, 0)
else:
product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
return ((removeQuotes(match.group(1)), product), 1)
import sys
import os
from test_helper import Test
baseDir = os.path.join('data')
inputPath = os.path.join('cs100', 'lab3')
GOOGLE_PATH = 'Google.csv'
GOOGLE_SMALL_PATH = 'Google_small.csv'
AMAZON_PATH = 'Amazon.csv'
AMAZON_SMALL_PATH = 'Amazon_small.csv'
GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping.csv'
STOPWORDS_PATH = 'stopwords.txt'
def parseData(filename):
""" Parse a data file
Args:
filename (str): input file name of the data file
Returns:
RDD: a RDD of parsed lines
"""
return (sc
.textFile(filename, 4, 0)
.map(parseDatafileLine)
.cache())
def loadData(path):
""" Load a data file
Args:
path (str): input file name of the data file
Returns:
RDD: a RDD of parsed valid lines
"""
filename = os.path.join(baseDir, inputPath, path)
raw = parseData(filename).cache()
failed = (raw
.filter(lambda s: s[1] == -1)
.map(lambda s: s[0]))
for line in failed.take(10):
print '%s - Invalid datafile line: %s' % (path, line)
valid = (raw
.filter(lambda s: s[1] == 1)
.map(lambda s: s[0])
.cache())
print '%s - Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (path,
raw.count(),
valid.count(),
failed.count())
assert failed.count() == 0
assert raw.count() == (valid.count() + 1)
return valid
googleSmall = loadData(GOOGLE_SMALL_PATH)
google = loadData(GOOGLE_PATH)
amazonSmall = loadData(AMAZON_SMALL_PATH)
amazon = loadData(AMAZON_PATH)
通过跑这段代码,我们得到如下结果。
Google_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Google.csv - Read 3227 lines, successfully parsed 3226 lines, failed to parse 0 lines
Amazon_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Amazon.csv - Read 1364 lines, successfully parsed 1363 lines, failed to parse 0 lines
我们跑这段代码看看数据长什么样子。
for line in googleSmall.take(3):
print 'google: %s: %s\n' % (line[0], line[1])
for line in amazonSmall.take(3):
print 'amazon: %s: %s\n' % (line[0], line[1])
google: http://www.google.com/base/feeds/snippets/11448761432933644608: spanish vocabulary builder "expand your vocabulary! contains fun lessons that both teach and entertain you'll quickly find yourself mastering new terms. includes games and more!"
google: http://www.google.com/base/feeds/snippets/8175198959985911471: topics presents: museums of world "5 cd-rom set. step behind the velvet rope to examine some of the most treasured collections of antiquities art and inventions. includes the following the louvre - virtual visit 25 rooms in full screen interactive video detailed map of the louvre ..."
google: http://www.google.com/base/feeds/snippets/18445827127704822533: sierrahome hse hallmark card studio special edition win 98 me 2000 xp "hallmark card studio special edition (win 98 me 2000 xp)" "sierrahome"
amazon: b000jz4hqo: clickart 950 000 - premier image pack (dvd-rom) "broderbund"
amazon: b0006zf55o: ca international - arcserve lap/desktop oem 30pk "oem arcserve backup v11.1 win 30u for laptops and desktops" "computer associates"
amazon: b00004tkvy: noah's ark activity center (jewel case ages 3-8) "victory multimedia"
Part 1 ER as Text Similarity - Bags of Words
我们在解析实体的时候,经常把所有的记录都当成字符串来处理,然后计算它们的相似度。这里我们用bag of words的方法。这个在文本分析中是简单且有效的方法。其主要思想是,把一个文档当作是没有顺序的word的合集,或者是tokens的合集。token是我们分解完文档后的最小单位,它可能是单词,数字,缩写等等。
当我们比较两个文档的相似度的时候,我们会看看这两个文档有多少个共同的token。而且当我们用关键词搜索文档时,我们可以直接看转换后的文档是否含有这个key。这个方法的优点就是,它对单词顺序和标点符号存在一定的鲁棒性。
Tokenize a String
下面是开始了作业部分了,注释里面含有TODO的,就表示这个功能要我们实现。我们要实现的函数是把一个String转换为一个token的list,要注意的是,把所以的token转换为小写。
# TODO: Replace <FILL IN> with appropriate code
quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
split_regex = r'\W+'
def simpleTokenize(string):
""" A simple implementation of input string tokenization
Args:
string (str): input string
Returns:
list: a list of tokens
"""
return [x for x in filter(lambda x:len(x) > 0, re.split(split_regex,string.lower()))]
print simpleTokenize(quickbrownfox) # Should give ['a', 'quick', 'brown', ... ]
这个比较难,我稍微解释一下。filter(function, sequence):对sequence中的item依次执行function(item),将执行结果为True的item组成一个List/String/Tuple(取决于sequence的类型)返回。所以这里的function就是lambda x:len(x) > 0,sequence就是re.split(split_regex,string.lower()))。re.split和str.split不一样。
>>>'hello, world'.split(',')
>>>['hello',' world']
>>>re.split(r'\W+','hello, world')
>>>['hello','world']
Removing stopwords
在英语中,stopwords指那些对整个句子的意义没多大作用的单词,比如"the", "a", "is", "to",这在bag of words方法里,就是噪声了。因为这些单词在句子太常见了,两个没有联系的句子可能因为stopwords太多而被判断相似。环境给我们提供了stopwords的文档,我们读取后转换为set,直接用in来判断就行。
# TODO: Replace <FILL IN> with appropriate code
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
def tokenize(string):
""" An implementation of input string tokenization that excludes stopwords
Args:
string (str): input string
Returns:
list: a list of tokens without stopwords
"""
return [x for x in simpleTokenize(string) if x not in stopwords]
print tokenize(quickbrownfox) # Should give ['quick', 'brown', ... ]
Tokenizing the small datasets
这里是对之前的汇总。这里要统计一个文档的所有token,直接把每行的token总数加起来就行。
# TODO: Replace <FILL IN> with appropriate code
amazonRecToToken = amazonSmall.map(lambda (a,b): (a,tokenize(b)))
googleRecToToken = googleSmall.map(lambda (a,b): (a,tokenize(b)))
def countTokens(vendorRDD):
""" Count and return the number of tokens
Args:
vendorRDD (RDD of (recordId, tokenizedValue)): Pair tuple of record ID to tokenized output
Returns:
count: count of all tokens
"""
return vendorRDD.map(lambda x: len(x[1])).reduce(lambda x,y: x+y)
totalTokens = countTokens(amazonRecToToken) + countTokens(googleRecToToken)
print 'There are %s tokens in the combined datasets' % totalTokens
Amazon record with the most tokens
这里又要排序了。只不过要按照value的长度从大到小排序。
# TODO: Replace <FILL IN> with appropriate code
def findBiggestRecord(vendorRDD):
""" Find and return the record with the largest number of tokens
Args:
vendorRDD (RDD of (recordId, tokens)): input Pair Tuple of record ID and tokens
Returns:
list: a list of 1 Pair Tuple of record ID and tokens
"""
return vendorRDD.takeOrdered(1, key=lambda x: -1*len(x[1]))
biggestRecordAmazon = findBiggestRecord(amazonRecToToken)
print 'The Amazon record with ID "%s" has the most tokens (%s)' % (biggestRecordAmazon[0][0], len(biggestRecordAmazon[0][1]))
Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF
Bag of words在实际中效果不太好,因为不同单词在一个文档里面的意义是不一样的,用数学的观点就是,权重不一样。仅仅用频率来衡量一个单词的权重是不科学的。所以有了TF-IDF算法。这里推荐阮一峰老师的TF-IDF与余弦相似性的应用,讲的非常通俗易懂。
Implement a TF function
这里要实现TF function。入参是string的list。我们要做的是统计每个单词出现的次数和总的单词次数,然后用每个单词的次数除以总的单词数,这就是TF-IDF里的TF了。
# TODO: Replace <FILL IN> with appropriate code
def tf(tokens):
""" Compute TF
Args:
tokens (list of str): input list of tokens from tokenize
Returns:
dictionary: a dictionary of tokens to its TF values
"""
dic = {}
count = 0
for word in tokens:
if(word in dic):
dic[word] += 1
else:
dic[word] = 1
count += 1
for key in dic:
dic[key] = float(dic[key])/count
return dic
print tf(tokenize(quickbrownfox)) # Should give { 'quick': 0.1666 ... }
Create a corpus
这里是把两个RDD合在一起。用union()就行。
# TODO: Replace <FILL IN> with appropriate code
corpusRDD = amazonRecToToken.union(googleRecToToken)
Implement an IDFs function
IDF算法需要全量的数据,所以上一步是把数据合在一起。这里需要大家理解了IDF才能做好。
# TODO: Replace <FILL IN> with appropriate code
def idfs(corpus):
""" Compute IDF
Args:
corpus (RDD): input corpus
Returns:
RDD: a RDD of (token, IDF value)
"""
N = corpus.count()
uniqueTokens = corpus.map(lambda x:(x[0],list(set(x[1]))))
tokenCountPairTuple = uniqueTokens.flatMap(lambda x:x[1]).map(lambda x: (x,1))
tokenSumPairTuple = tokenCountPairTuple.reduceByKey(lambda a,b : a+b)
return (tokenSumPairTuple.map(lambda x:(x[0],N/float(x[1]))))
idfsSmall = idfs(amazonRecToToken.union(googleRecToToken))
uniqueTokenCount = idfsSmall.count()
print 'There are %s unique tokens in the small datasets.' % uniqueTokenCount
Tokens with the smallest IDF
smallIDFTokens = idfsSmall.takeOrdered(11, lambda s: s[1])
print smallIDFTokens
IDF Histogram
import matplotlib.pyplot as plt
small_idf_values = idfsSmall.map(lambda s: s[1]).collect()
fig = plt.figure(figsize=(8,3))
plt.hist(small_idf_values, 50, log=True)
pass
Implement a TF-IDF function
这一步是把上面的步骤结合起来了。
# TODO: Replace <FILL IN> with appropriate code
def tfidf(tokens, idfs):
""" Compute TF-IDF
Args:
tokens (list of str): input list of tokens from tokenize
idfs (dictionary): record to IDF value
Returns:
dictionary: a dictionary of records to TF-IDF values
"""
tfs = tf(tokens)
tfIdfDict = {t:tfs[t]*idfs[t] for t in tfs}
return tfIdfDict
recb000hkgj8k = amazonRecToToken.filter(lambda x: x[0] == 'b000hkgj8k').collect()[0][1]
idfsSmallWeights = idfsSmall.collectAsMap()
rec_b000hkgj8k_weights = tfidf(recb000hkgj8k, idfsSmallWeights)
print 'Amazon record "b000hkgj8k" has tokens and weights:\n%s' % rec_b000hkgj8k_weights
Part 3 ER as Text Similarity - Cosine Similarity
这里有关余弦相似度的问题,还是参见阮一峰老师的TF-IDF与余弦相似性的应用。
Implement the components of a cosineSimilarity function
这里实现余弦相似度分三步:计算两个向量的内积;计算向量的长度;结合上面两步。
# TODO: Replace <FILL IN> with appropriate code
import math
def dotprod(a, b):
""" Compute dot product
Args:
a (dictionary): first dictionary of record to value
b (dictionary): second dictionary of record to value
Returns:
dotProd: result of the dot product with the two input dictionaries
"""
return sum(a[k]*b[k]for k in a if k in b)
def norm(a):
""" Compute square root of the dot product
Args:
a (dictionary): a dictionary of record to value
Returns:
norm: a dictionary of tokens to its TF values
"""
count=0
for key in a:
count += a[key]*a[key]
return math.sqrt(count)
def cossim(a, b):
""" Compute cosine similarity
Args:
a (dictionary): first dictionary of record to value
b (dictionary): second dictionary of record to value
Returns:
cossim: dot product of two dictionaries divided by the norm of the first dictionary and
then by the norm of the second dictionary
"""
return dotprod(a,b)/(norm(a)*norm(b))
testVec1 = {'foo': 2, 'bar': 3, 'baz': 5 }
testVec2 = {'foo': 1, 'bar': 0, 'baz': 20 }
dp = dotprod(testVec1, testVec2)
nm = norm(testVec1)
print dp, nm
Implement a cosineSimilarity function
# TODO: Replace <FILL IN> with appropriate code
def cosineSimilarity(string1, string2, idfsDictionary):
""" Compute cosine similarity between two strings
Args:
string1 (str): first string
string2 (str): second string
idfsDictionary (dictionary): a dictionary of IDF values
Returns:
cossim: cosine similarity value
"""
w1 = tfidf(tokenize(string1),idfsDictionary)
w2 = tfidf(tokenize(string2),idfsDictionary)
return cossim(w1, w2)
cossimAdobe = cosineSimilarity('Adobe Photoshop',
'Adobe Illustrator',
idfsSmallWeights)
print cossimAdobe
Perform Entity Resolution
这里我们先计算google数据里面到amazon数据里面记录的相似度,计算结果保存成key为(Google URL, Amazon ID),value为余弦相似度的值。我们会用两种方法来计算,第一种是不用broadcast variable。
这里分三步走:1.计算所有的tuple,格式是[ ((Google URL1, Google String1), (Amazon ID1, Amazon String1)), ((Google URL1, Google String1), (Amazon ID2, Amazon String2)), ((Google URL2, Google String2), (Amazon ID1, Amazon String1)), ... ]。2.写个函数计算所有tuple的余弦相似度结果。3.把函数用到RDD中。
# TODO: Replace <FILL IN> with appropriate code
crossSmall = (googleSmall
.cartesian(amazonSmall)
.cache())
def computeSimilarity(record):
""" Compute similarity on a combination record
Args:
record: a pair, (google record, amazon record)
Returns:
pair: a pair, (google URL, amazon ID, cosine similarity value)
"""
googleRec = record[0]
amazonRec = record[1]
googleURL = googleRec[0]
amazonID = amazonRec[0]
googleValue = googleRec[1]
amazonValue = amazonRec[1]
cs = cosineSimilarity(googleValue,amazonValue,idfsSmallWeights)
return (googleURL, amazonID, cs)
similarities = (crossSmall
.map(lambda line:computeSimilarity(line))
.cache())
def similar(amazonID, googleURL):
""" Return similarity value
Args:
amazonID: amazon ID
googleURL: google URL
Returns:
similar: cosine similarity value
"""
return (similarities
.filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
.collect()[0][2])
similarityAmazonGoogle = similar('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogle
Perform Entity Resolution with Broadcast Variables
上面那一步对小数据集还可以,但是数据量太大的时候,Spark需要把计算好的idf传向各个worker。假如我们没有缓存的话,Spark可能会重复计算相似度。这会导致Spark多次传idf值。
所以我们用broadcast variable解决这个问题。它只需要传一次就行。代码和上一步差不多。
# TODO: Replace <FILL IN> with appropriate code
def computeSimilarityBroadcast(record):
""" Compute similarity on a combination record, using Broadcast variable
Args:
record: a pair, (google record, amazon record)
Returns:
pair: a pair, (google URL, amazon ID, cosine similarity value)
"""
googleRec = record[0]
amazonRec = record[1]
googleURL = googleRec[0]
amazonID = amazonRec[0]
googleValue = googleRec[1]
amazonValue = amazonRec[1]
cs = cosineSimilarity(googleValue,amazonValue,idfsSmallBroadcast.value)
return (googleURL, amazonID, cs)
idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)
similaritiesBroadcast = (crossSmall
.map(lambda record:computeSimilarityBroadcast(record))
.cache())
def similarBroadcast(amazonID, googleURL):
""" Return similarity value, computed using Broadcast variable
Args:
amazonID: amazon ID
googleURL: google URL
Returns:
similar: cosine similarity value
"""
return (similaritiesBroadcast
.filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
.collect()[0][2])
similarityAmazonGoogleBroadcast = similarBroadcast('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogleBroadcast
Perform a Gold Standard evaluation
下面我们要用gold standard的数据来回答一些问题,我们首先读取和解析数据。
GOLDFILE_PATTERN = '^(.+),(.+)'
# Parse each line of a data file useing the specified regular expression pattern
def parse_goldfile_line(goldfile_line):
""" Parse a line from the 'golden standard' data file
Args:
goldfile_line: a line of data
Returns:
pair: ((key, 'gold', 1 if successful or else 0))
"""
match = re.search(GOLDFILE_PATTERN, goldfile_line)
if match is None:
print 'Invalid goldfile line: %s' % goldfile_line
return (goldfile_line, -1)
elif match.group(1) == '"idAmazon"':
print 'Header datafile line: %s' % goldfile_line
return (goldfile_line, 0)
else:
key = '%s %s' % (removeQuotes(match.group(1)), removeQuotes(match.group(2)))
return ((key, 'gold'), 1)
goldfile = os.path.join(baseDir, inputPath, GOLD_STANDARD_PATH)
gsRaw = (sc
.textFile(goldfile)
.map(parse_goldfile_line)
.cache())
gsFailed = (gsRaw
.filter(lambda s: s[1] == -1)
.map(lambda s: s[0]))
for line in gsFailed.take(10):
print 'Invalid goldfile line: %s' % line
goldStandard = (gsRaw
.filter(lambda s: s[1] == 1)
.map(lambda s: s[0])
.cache())
print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (gsRaw.count(),
goldStandard.count(),
gsFailed.count())
assert (gsFailed.count() == 0)
assert (gsRaw.count() == (goldStandard.count() + 1))
接下来,我们把之前算好的有相似度的RDD和这里的gold standard的RDD用join()结合起来,然后统计有多少个pairs,以及计算相似度的平均值。然后计算没有匹配上的相似度平均值。
# TODO: Replace <FILL IN> with appropriate code
sims = similaritiesBroadcast.map(lambda line:('%s %s' %(line[1],line[0]),line[2]))
trueDupsRDD = (sims.join(goldStandard))
trueDupsCount = trueDupsRDD.count()
avgSimDups = trueDupsRDD.map(lambda (k,v):v[0]).mean()
nonDupsRDD = (sims
.leftOuterJoin(goldStandard).map(lambda (k,v):v[0] if v[1] is None else -1)).filter(lambda v:v!=-1)
avgSimNon = nonDupsRDD.mean()
print 'There are %s true duplicates.' % trueDupsCount
print 'The average similarity of true duplicates is %s.' % avgSimDups
print 'And for non duplicates, it is %s.' % avgSimNon
Part 4 Scalable ER
上面的例子不完全属于分布式实现,所以时间复杂度会很高。我们在这一部分将介绍更加适合分布式的算法。
我们在计算token和权重时,由于记录之间的token比较,所以消耗了大量的计算。我们这里用一个叫inverted index的数据结构来避免平方增长率的token比较。它把数据集映射成token和文档,key为token,value为含有该token的文档。
Tokenize the full dataset
# TODO: Replace <FILL IN> with appropriate code
amazonFullRecToToken = amazon.map(lambda (k,v):(k,tokenize(v)))
googleFullRecToToken = google.map(lambda (k,v):(k,tokenize(v)))
print 'Amazon full dataset is %s products, Google full dataset is %s products' % (amazonFullRecToToken.count(),
googleFullRecToToken.count())
Compute IDFs and TF-IDFs for the full datasets
这里会用到之前的代码。我们要做的是,把新的RDD组合到一起,实现idf算法,并设为broadcast variable。
# TODO: Replace <FILL IN> with appropriate code
fullCorpusRDD = amazonFullRecToToken.union(googleFullRecToToken)
idfsFull = idfs(fullCorpusRDD)
idfsFullCount = idfsFull.count()
print 'There are %s unique tokens in the full datasets.' % idfsFullCount
# Recompute IDFs for full dataset
idfsFullWeights = idfsFull.collectAsMap()
idfsFullBroadcast = sc.broadcast(idfsFullWeights)
# Pre-compute TF-IDF weights. Build mappings from record ID weight vector.
amazonWeightsRDD = amazonFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value)))
googleWeightsRDD = googleFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value)))
print 'There are %s Amazon weights and %s Google weights.' % (amazonWeightsRDD.count(),
googleWeightsRDD.count())
Compute Norms for the weights from the full datasets
# TODO: Replace <FILL IN> with appropriate code
amazonNorms = amazonWeightsRDD.map(lambda (k,d):(k,norm(d)))
amazonNormsBroadcast = sc.broadcast(amazonNorms.collectAsMap())
googleNorms = googleWeightsRDD.map(lambda (k,d):(k,norm(d)))
googleNormsBroadcast = sc.broadcast(googleNorms.collectAsMap())
Create inverted indicies from the full datasets
这里我们要做两步:实现invert function,输入是(ID, tokens),输出是(ID, token vector);把该函数用到上面的结果中,得到token和含有token的文档的映射。
# TODO: Replace <FILL IN> with appropriate code
def invert(record):
""" Invert (ID, tokens) to a list of (token, ID)
Args:
record: a pair, (ID, token vector)
Returns:
pairs: a list of pairs of token to ID
"""
ID, tokenvector = record
pairs = [(k,ID) for k in tokenvector]
return (pairs)
amazonInvPairsRDD = (amazonWeightsRDD
.flatMap(invert)
.cache())
googleInvPairsRDD = (googleWeightsRDD
.flatMap(invert)
.cache())
print 'There are %s Amazon inverted pairs and %s Google inverted pairs.' % (amazonInvPairsRDD.count(),
googleInvPairsRDD.count())
Identify common tokens from the full dataset
这里把amazon的RDD和google的RDD合起来,得到((ID, URL), token)。
# TODO: Replace <FILL IN> with appropriate code
def swap(record):
""" Swap (token, (ID, URL)) to ((ID, URL), token)
Args:
record: a pair, (token, (ID, URL))
Returns:
pair: ((ID, URL), token)
"""
token = record[0]
keys = record[1]
return (keys, token)
commonTokens = (amazonInvPairsRDD
.join(googleInvPairsRDD)
.map(swap)
.groupByKey()
.cache())
print 'Found %d common tokens' % commonTokens.count()
Identify common tokens from the full dataset
最后一步了,这里要把之前计算的两个RDD:amazonWeights和googleWeights和上面的结果结合起来,计算权重。
# TODO: Replace <FILL IN> with appropriate code
amazonWeightsBroadcast = sc.broadcast(amazonWeightsRDD.collectAsMap())
googleWeightsBroadcast = sc.broadcast(googleWeightsRDD.collectAsMap())
def fastCosineSimilarity(record):
""" Compute Cosine Similarity using Broadcast variables
Args:
record: ((ID, URL), token)
Returns:
pair: ((ID, URL), cosine similarity value)
"""
amazonRec = record[0][0]
googleRec = record[0][1]
tokens = record[1]
value = sum((amazonWeightsBroadcast.value[amazonRec][t])*(googleWeightsBroadcast.value[googleRec][t])\
for t in tokens if t in amazonWeightsBroadcast.value[amazonRec] and t in googleWeightsBroadcast.value[googleRec])\
/((amazonNormsBroadcast.value[amazonRec])*(googleNormsBroadcast.value[googleRec]))
key = (amazonRec, googleRec)
return (key, value)
similaritiesFullRDD = (commonTokens
.map(fastCosineSimilarity)
.cache())
print similaritiesFullRDD.count()
Part 5 Analysis
计算部分到此结束。我们现在要验证结果了。我们需要选个阈值来觉得两个数据集的记录是否是一个实体。我们可以通过precision and recall来判断。一般来说用F-score来衡量模型的好坏。
Counting True Positives, False Positives, and False Negatives
# Create an RDD of ((Amazon ID, Google URL), similarity score)
simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
assert (simsFullRDD.count() == 2441100)
# Create an RDD of just the similarity scores
simsFullValuesRDD = (simsFullRDD
.map(lambda x: x[1])
.cache())
assert (simsFullValuesRDD.count() == 2441100)
# Look up all similarity scores for true duplicates
# This helper function will return the similarity score for records that are in the gold standard and the simsFullRDD (True positives), and will return 0 for records that are in the gold standard but not in simsFullRDD (False Negatives).
def gs_value(record):
if (record[1][1] is None):
return 0
else:
return record[1][1]
# Join the gold standard and simsFullRDD, and then extract the similarities scores using the helper function
trueDupSimsRDD = (goldStandard
.leftOuterJoin(simsFullRDD)
.map(gs_value)
.cache())
print 'There are %s true duplicates.' % trueDupSimsRDD.count()
assert(trueDupSimsRDD.count() == 1300)
为了选一个合适的阈值,我们用Spark Accumulators来实现counting function。这是第一次出现Spark Accumulators的用法。
from pyspark.accumulators import AccumulatorParam
class VectorAccumulatorParam(AccumulatorParam):
# Initialize the VectorAccumulator to 0
def zero(self, value):
return [0] * len(value)
# Add two VectorAccumulator variables
def addInPlace(self, val1, val2):
for i in xrange(len(val1)):
val1[i] += val2[i]
return val1
# Return a list with entry x set to value and all other entries set to 0
def set_bit(x, value, length):
bits = []
for y in xrange(length):
if (x == y):
bits.append(value)
else:
bits.append(0)
return bits
# Pre-bin counts of false positives for different threshold ranges
BINS = 101
nthresholds = 100
def bin(similarity):
return int(similarity * nthresholds)
# fpCounts[i] = number of entries (possible false positives) where bin(similarity) == i
zeros = [0] * BINS
fpCounts = sc.accumulator(zeros, VectorAccumulatorParam())
def add_element(score):
global fpCounts
b = bin(score)
fpCounts += set_bit(b, 1, BINS)
simsFullValuesRDD.foreach(add_element)
# Remove true positives from FP counts
def sub_element(score):
global fpCounts
b = bin(score)
fpCounts += set_bit(b, -1, BINS)
trueDupSimsRDD.foreach(sub_element)
def falsepos(threshold):
fpList = fpCounts.value
return sum([fpList[b] for b in range(0, BINS) if float(b) / nthresholds >= threshold])
def falseneg(threshold):
return trueDupSimsRDD.filter(lambda x: x < threshold).count()
def truepos(threshold):
return trueDupSimsRDD.count() - falsenegDict[threshold]
Precision, Recall, and F-measures
# Precision = true-positives / (true-positives + false-positives)
# Recall = true-positives / (true-positives + false-negatives)
# F-measure = 2 x Recall x Precision / (Recall + Precision)
def precision(threshold):
tp = trueposDict[threshold]
return float(tp) / (tp + falseposDict[threshold])
def recall(threshold):
tp = trueposDict[threshold]
return float(tp) / (tp + falsenegDict[threshold])
def fmeasure(threshold):
r = recall(threshold)
p = precision(threshold)
return 2 * r * p / (r + p)
Line Plots
thresholds = [float(n) / nthresholds for n in range(0, nthresholds)]
falseposDict = dict([(t, falsepos(t)) for t in thresholds])
falsenegDict = dict([(t, falseneg(t)) for t in thresholds])
trueposDict = dict([(t, truepos(t)) for t in thresholds])
precisions = [precision(t) for t in thresholds]
recalls = [recall(t) for t in thresholds]
fmeasures = [fmeasure(t) for t in thresholds]
print precisions[0], fmeasures[0]
assert (abs(precisions[0] - 0.000532546802671) < 0.0000001)
assert (abs(fmeasures[0] - 0.00106452669505) < 0.0000001)
fig = plt.figure()
plt.plot(thresholds, precisions)
plt.plot(thresholds, recalls)
plt.plot(thresholds, fmeasures)
plt.legend(['Precision', 'Recall', 'F-measure'])
pass
用最先进的方法,我们的F-score能有60%,但是这里只有40%。我们可能从三个方面来改进:1.用其他的特征;2.用其他更好的模型来处理特征,比如stemming, n-grams;3.换一个衡量相似度的函数。