[Big Data] Week 3 (Basic)

Question 1

Suppose we hash the elements of a set S having 20 members, to a bit array of length 99. The array is initially all-0's, and we set a bit to 1 whenever a member of S hashes to it. The hash function is random and uniform in its distribution. What is the expected fraction of 0's in the array after hashing? What is the expected fraction of 1's? You may assume that 99 is large enough that asymptotic limits are reached.
 
Your Answer ScoreExplanation
The fraction of 1's is 79/99.      
The fraction of 1's is 1-e-20/99.  Correct 1.00  
The fraction of 1's is 20/99.      
The fraction of 0's is 20/99.      
Total   1.00 / 1.00  
 
To derive the correct formula, start by observing that the probability of any one element not hashing to a particular bit is 1-1/99, so the probability that no element hashes to a particular bit is (1-1/99)20.  Also, remember that in the limit as x goes to infinity, (1-1/x)x is e. 

Question 2

A certain Web mail service (like gmail, e.g.) has 108 users, and wishes to create a sample of data about these users, occupying 1010 bytes. Activity at the service can be viewed as a stream of elements, each of which is an email. The element contains the ID of the sender, which must be one of the 108 users of the service, and other information, e.g., the recipient(s), and contents of the message. The plan is to pick a subset of the users and collect in the 1010 bytes records of length 100 bytes about every email sent by the users in the selected set (and nothing about other users).

The method of Section 4.2.4 will be used. User ID's will be hashed to a bucket number, from 0 to 999,999. At all times, there will be a threshold t such that the 100-byte records for all the users whose ID's hash to t or less will be retained, and other users' records will not be retained. You may assume that each user generates emails at exactly the same rate as other users. As a function of n, the number of emails in the stream so far, what should the threshold t be in order that the selected records will not exceed the 1010 bytes available to store records? From the list below, identify the true statement about a value of n and its value of t.

 
Your Answer ScoreExplanation
n = 109; t = 999      
n = 1012; t = 999      
n = 1013; t = 9 Correct 1.00  
n = 1011; t = 1000      
Total   1.00 / 1.00

From the problem we know that there are currently N emails in the stream and 10^6 buckets and we can thus calculate the email capacity of each bucket  as  N/10^6  emails.

We also know that each email needs 100 bytes, hence the total space requirement per bucket is  (N/10^6)100 bytes

Let's consider the worst case scenario where all the N emails in the stream have to be retained.

Let's assume that the total number of buckets we would need for this scenario is  ( t+1 ) since we started the bucket count from 0.

So   (space requirement per bucket)    (total number of buckets)   <=  Total available space

(N/10^6)100    ( t + 1)  <= 1010 

Further simplification gives

  t <=  ( 10^14 / N ) -1

posted @   Zhentiw  阅读(1621)  评论(0编辑  收藏  举报
(评论功能已被禁用)
编辑推荐:
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具
点击右上角即可分享
微信分享提示