[Big Data] Week 2: Frequent Itemsets

Suppose we have transactions that satisfy the following assumptions:

s, the support threshold, is 10,000.
There are one million items, which are represented by the integers 0,1,...,999999.
There are N frequent items, that is, items that occur 10,000 times or more.
There are one million pairs that occur 10,000 times or more.
There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items, the other M each have at least one nonfrequent item.
No other pairs occur at all.
Integers are always represented by 4 bytes.

Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers.
As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.

Based on the diagram on page 315 of the book (and ignoring the 'item names to integers' portion as it doesn't arise in this question), on the first pass, I will need one data structure to hold the counts of each item. This will be an array of length 1,000,000 (as there are a million items), which at 4 bytes an integer, is 4 million bytes. That's fine.

Only N of that million items are frequent, so we're only interested in keeping an array of length N to keep the counts of the frequent items for our second pass. This will take up 4N bytes, and will replace the 4 million mentioned above.

The remainder of the memory required will be taken up with our count of the frequent pairs, which we can do using a triangular array or a hash table of triples.

The triangular array will consist of a one-dimensional array of length

Question 2

Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?

Question 3

Suppose ABC is a frequent itemset and BCDE is NOT a frequent itemset. Given this information, we can be sure that certain other itemsets are frequent and sure that certain itemsets are NOT frequent. Other itemsets may be either frequent or not. Which of the following is a correct classification of an itemset?

Your Answer		Score
N = 20,000; M = 60,000,000; S = 1,000,000,000
N = 10,000; M = 40,000,000; S = 200,000,000	Correct	1.00
N = 20,000; M = 80,000,000; S = 1,100,000,000
N = 100,000; M = 50,000,000; S = 5,000,000,000
Total

Your Answer		Score
{8} → 16
{4,6} → 12	Correct	1.00
{1,2} → 4
{2,3,5} → 45
Total		1.00 / 1.00

Your Answer		Score
ABCD can be either frequent or not frequent.	Correct	1.00
BCDEF can be either frequent or not frequent.
AB can be either frequent or not frequent.
ABCDE can be either frequent or not frequent.
Total		1.00 / 1.00

Answer1215

Question 1

Question 2

Question 3

公告