[Big Data] Week 2: Frequent Itemsets

Question 1

Suppose we have transactions that satisfy the following assumptions: 
  • s, the support threshold, is 10,000.
  • There are one million items, which are represented by the integers 0,1,...,999999.
  • There are N frequent items, that is, items that occur 10,000 times or more.
  • There are one million pairs that occur 10,000 times or more.
  • There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items, the other M each have at least one nonfrequent item.
  • No other pairs occur at all.
  • Integers are always represented by 4 bytes.
Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers. 
As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.
 

 

Your Answer ScoreExplanation
N = 20,000; M = 60,000,000; S = 1,000,000,000      
N = 10,000; M = 40,000,000; S = 200,000,000 Correct 1.00  
N = 20,000; M = 80,000,000; S = 1,100,000,000      
N = 100,000; M = 50,000,000; S = 5,000,000,000      
Total

 

Based on the diagram on page 315 of the book (and ignoring the 'item names to integers' portion as it doesn't arise in this question), on the first pass, I will need one data structure to hold the counts of each item. This will be an array of length 1,000,000 (as there are a million items), which at 4 bytes an integer, is 4 million bytes. That's fine.

Only N of that million items are frequent, so we're only interested in keeping an array of length N to keep the counts of the frequent items for our second pass. This will take up 4N bytes, and will replace the 4 million mentioned above.

The remainder of the memory required will be taken up with our count of the frequent pairs, which we can do using a triangular array or a hash table of triples.

The triangular array will consist of a one-dimensional array of length (N2), which, at 4 bytes per slot, takes up 2N2 bytes. So, for the triangular array method, our total number of bytes, S, will be calculated as follows:
S=2N2+4N

The hash table of triples way makes less sense to me, as I'm confused as to which lines of information I should use. I'm going to take a punt and, as I say above, please let me know exactly where I'm going wrong and explain to me what I should be doing and why.

Right, we'll need a hash table to hold M values. We'll need to record the two items in the pair and the count, so that means 3 integers x 4 bytes = 12 bytes per integer, so the size of this will be 12M.

So, by going the hash table route, S will be:
S=4N+12M

None of the options given to me in the multiple-choice fit either of the two equations, and I can't think of how I would determine which method is taking up more memory (the triangular array method or the hash table one), since we are supposed to determine the minimum number of bytes of memory we'll need for the job. I'd appreciate your help on this too, please.

 

Question 2

Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?
 
Your Answer ScoreExplanation
{8} → 16      
{4,6} → 12 Correct 1.00  
{1,2} → 4      
{2,3,5} → 45      
Total   1.00 / 1.00

 

Question 3

Suppose ABC is a frequent itemset and BCDE is NOT a frequent itemset. Given this information, we can be sure that certain other itemsets are frequent and sure that certain itemsets are NOT frequent. Other itemsets may be either frequent or not. Which of the following is a correct classification of an itemset?
 
Your Answer ScoreExplanation
ABCD can be either frequent or not frequent. Correct 1.00  
BCDEF can be either frequent or not frequent.      
AB can be either frequent or not frequent.      
ABCDE can be either frequent or not frequent.      
Total   1.00 / 1.00
posted @ 2014-10-15 01:58  Zhentiw  阅读(3555)  评论(0编辑  收藏  举报