[Big Data] Week 2: Frequent Itemsets (Advanced)
Question 1
Suppose we perform the PCY algorithm to find frequent pairs, with market-basket data meeting the following specifications:
- s, the support threshold, is 10,000.
- There are one million items, which are represented by the integers 0,1,...,999999.
- There are 250,000 frequent items, that is, items that occur 10,000 times or more.
- There are one million pairs that occur 10,000 times or more.
- There are P pairs that occur exactly once and consist of 2 frequent items.
- No other pairs occur at all.
- Integers are always represented by 4 bytes.
- When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs that occur once.
Suppose there are S bytes of main memory. In order to run the PCY algorithm successfully, the number of buckets must be sufficiently large that most buckets are not large. In addition, on the second pass, there must be enough room to count all the candidate pairs. As a function of S, what is the largest value of P for which we can successfully run the PCY algorithm on this data? Demonstrate that you have the correct formula by indicating which of the following is a value for S and a value for P that is approximately (i.e., to within 10%) the largest possible value of P for that S.
Your Answer | Score | Explanation | |
---|---|---|---|
S = 1,000,000,000; P = 35,000,000,000 | |||
S = 1,000,000,000; P = 20,000,000,000 | Correct | 1.00 | |
S = 100,000,000; P = 540,000,000 | |||
S = 200,000,000; P = 400,000,000 | |||
Total | 1.00 / 1.00 |
Expected memory required for pass 2:
- Expected number of frequent buckets: 1,000,000.
Each frequent bucket hashes (1 + P/#buckets) pairs, which we'll simplify to P/#buckets pairs. - Probability of a bucket to be frequent: 1,000,000 / #buckets
- Number of pairs that are both frequent, and map to a frequent bucket:
P * (1,000,000 / #buckets) - Total expected memory consumption for pass 2 (12 bytes per pair):
- P * 12,000,000 / #buckets
In pass 1, we need some space to count items (~4MB), and we can use the remainder of S as a hash table to help eliminate non-frequent pairs later on. This hash table is one integer (4 bytes) per bucket, so we can have at most (S - 4MB) / 4 ~= S/4 buckets in this hash table. Before we do pass 2, we will compress this table to a bitmap, but the number of buckets is still S/4. The buckets will just take less space in pass 2. That's true, S/4 is an upper bound for #buckets (::facepalm:: how did I miss this simple bound?) If we'll use S/4 buckets, and compress it to S/32 bytes using bitmapping, we're left with S*31/32 bytes for counting pairs. According to my analysis, on the 2nd pass we'll need (P * 12,000,000 / #buckets) bytes for counting. So if #buckets = S/4, We'll need P * 12,000,000 / (S/4) = 48,000,000*P/S byes for counting. since we have S*31/32 bytes free: S*31/32 = 48,000,000*P/S so S^2 = 49,548,387 * P and our bound: P < S^2 / 49,548,387
Question 2
During a run of Toivonen's Algorithm with set of items {A,B,C,D,E,F,G,H} a sample is found to have the following maximal frequent itemsets: {A,B}, {A,C}, {A,D}, {B,C}, {E}, {F}. Compute the negative border. Then, identify in the list below the set that is NOT in the negative border.
Your Answer | Score | Explanation | |
---|---|---|---|
{H} | |||
{F,G} | Correct | 1.00 | Correct! This set is not in the negative border because immediate proper subset {G} is not frequent. |
{B,D} | |||
{B,F} | |||
Total | 1.00 / 1.00 |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具