[Big Data] Week 2: Frequent Itemsets
Question 1
- s, the support threshold, is 10,000.
- There are one million items, which are represented by the integers 0,1,...,999999.
- There are N frequent items, that is, items that occur 10,000 times or more.
- There are one million pairs that occur 10,000 times or more.
- There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items, the other M each have at least one nonfrequent item.
- No other pairs occur at all.
- Integers are always represented by 4 bytes.
As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.
Your Answer | Score | Explanation | |
---|---|---|---|
N = 20,000; M = 60,000,000; S = 1,000,000,000 | |||
N = 10,000; M = 40,000,000; S = 200,000,000 | Correct | 1.00 | |
N = 20,000; M = 80,000,000; S = 1,100,000,000 | |||
N = 100,000; M = 50,000,000; S = 5,000,000,000 | |||
Total |
Based on the diagram on page 315 of the book (and ignoring the 'item names to integers' portion as it doesn't arise in this question), on the first pass, I will need one data structure to hold the counts of each item. This will be an array of length 1,000,000 (as there are a million items), which at 4 bytes an integer, is 4 million bytes. That's fine.
Only N of that million items are frequent, so we're only interested in keeping an array of length N to keep the counts of the frequent items for our second pass. This will take up 4N bytes, and will replace the 4 million mentioned above.
The remainder of the memory required will be taken up with our count of the frequent pairs, which we can do using a triangular array or a hash table of triples.
The triangular array will consist of a one-dimensional array of length (N2), which, at 4 bytes per slot, takes up 2N2 bytes. So, for the triangular array method, our total number of bytes, S, will be calculated as follows:
S=2N2+4N
The hash table of triples way makes less sense to me, as I'm confused as to which lines of information I should use. I'm going to take a punt and, as I say above, please let me know exactly where I'm going wrong and explain to me what I should be doing and why.
Right, we'll need a hash table to hold M values. We'll need to record the two items in the pair and the count, so that means 3 integers x 4 bytes = 12 bytes per integer, so the size of this will be 12M.
So, by going the hash table route, S will be:
S=4N+12M
None of the options given to me in the multiple-choice fit either of the two equations, and I can't think of how I would determine which method is taking up more memory (the triangular array method or the hash table one), since we are supposed to determine the minimum number of bytes of memory we'll need for the job. I'd appreciate your help on this too, please.
Question 2
Your Answer | Score | Explanation | |
---|---|---|---|
{8} → 16 | |||
{4,6} → 12 | Correct | 1.00 | |
{1,2} → 4 | |||
{2,3,5} → 45 | |||
Total | 1.00 / 1.00 |
Question 3
Your Answer | Score | Explanation | |
---|---|---|---|
ABCD can be either frequent or not frequent. | Correct | 1.00 | |
BCDEF can be either frequent or not frequent. | |||
AB can be either frequent or not frequent. | |||
ABCDE can be either frequent or not frequent. | |||
Total | 1.00 / 1.00 |