[Big Data] Week 2: Frequent Itemsets (Advanced)

Question 1

Suppose we perform the PCY algorithm to find frequent pairs, with market-basket data meeting the following specifications:

s, the support threshold, is 10,000.
There are one million items, which are represented by the integers 0,1,...,999999.
There are 250,000 frequent items, that is, items that occur 10,000 times or more.
There are one million pairs that occur 10,000 times or more.
There are P pairs that occur exactly once and consist of 2 frequent items.
No other pairs occur at all.
Integers are always represented by 4 bytes.
When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs that occur once.

Suppose there are S bytes of main memory. In order to run the PCY algorithm successfully, the number of buckets must be sufficiently large that most buckets are not large. In addition, on the second pass, there must be enough room to count all the candidate pairs. As a function of S, what is the largest value of P for which we can successfully run the PCY algorithm on this data? Demonstrate that you have the correct formula by indicating which of the following is a value for S and a value for P that is approximately (i.e., to within 10%) the largest possible value of P for that S.

Your Answer		Score
S = 1,000,000,000; P = 35,000,000,000
S = 1,000,000,000; P = 20,000,000,000	Correct	1.00
S = 100,000,000; P = 540,000,000
S = 200,000,000; P = 400,000,000
Total		1.00 / 1.00

Expected memory required for pass 2:

Expected number of frequent buckets: 1,000,000.
Each frequent bucket hashes (1 + P/#buckets) pairs, which we'll simplify to P/#buckets pairs.
Probability of a bucket to be frequent: 1,000,000 / #buckets
Number of pairs that are both frequent, and map to a frequent bucket:
P * (1,000,000 / #buckets)
Total expected memory consumption for pass 2 (12 bytes per pair):
P * 12,000,000 / #buckets


In pass 1, we need some space to count items (~4MB), and we can use the remainder of S as a hash table to help eliminate non-frequent pairs later on. This hash table is one integer (4 bytes) per bucket, so we can have at most (S - 4MB) / 4 ~= S/4 buckets in this hash table. 

Before we do pass 2, we will compress this table to a bitmap, but the number of buckets is still S/4. The buckets will just take less space in pass 2.

That's true, S/4 is an upper bound for #buckets (::facepalm:: how did I miss this simple bound?)
If we'll use S/4 buckets, and compress it to S/32 bytes using bitmapping, we're left with S*31/32 bytes for counting pairs.

According to my analysis, on the 2nd pass we'll need  (P * 12,000,000 / #buckets) bytes for counting.
So if #buckets = S/4,
We'll need P * 12,000,000 / (S/4) = 48,000,000*P/S   byes for counting.

since we have S*31/32 bytes free:
S*31/32 =  48,000,000*P/S
so
S^2 = 49,548,387 * P

and our bound:
P < S^2 / 49,548,387

Question 2

During a run of Toivonen's Algorithm with set of items {A,B,C,D,E,F,G,H} a sample is found to have the following maximal frequent itemsets: {A,B}, {A,C}, {A,D}, {B,C}, {E}, {F}. Compute the negative border. Then, identify in the list below the set that is NOT in the negative border.

Your Answer		Score	Explanation
{H}
{F,G}	Correct	1.00	Correct! This set is not in the negative border because immediate proper subset {G} is not frequent.
{B,D}
{B,F}
Total		1.00 / 1.00

posted @ 2014-10-16 05:22 Zhentiw 阅读(3637) 评论(0) 收藏举报

刷新页面返回顶部

Answer1215

[Big Data] Week 2: Frequent Itemsets (Advanced)

Question 1

Question 2

公告