[Big Data] Week 2: LSH (Basic)

Question 1

The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. Then, identify which of the following is a true statement about the number of pairs at a certain edit distance.
 
Your Answer ScoreExplanation
There are 3 pairs at distance 1.      
There is 1 pair at distance 4. Correct 1.00  
There are 4 pairs at distance 5.      
There is 1 pair at distance 3.      
Total   1.00 / 1.00  

Question 2

Consider the following matrix:

 

 C1C2C3C4
R1 0 1 1 0
R2 1 0 1 1
R3 0 1 0 1
R4 0 0 1 0
R5 1 0 1 0
R6 0 1 0 0

Perform a minhashing of the data, with the order of rows: R4, R6, R1, R3, R5, R2. Which of the following is the correct minhash value of the stated column? Note: we give the minhash value in terms of the original name of the row, rather than the order of the row in the permutation. These two schemes are equivalent, since we only care whether hash values for two columns are equal, not what their actual values are.

 
Your Answer ScoreExplanation
The minhash value for C1 is R6      
The minhash value for C3 is R4 Correct 1.00  
The minhash value for C1 is R2      
The minhash value for C3 is R5      
Total   1.00 / 1.00  

Question 3

Here is a matrix representing the signatures of seven columns, C1 through C7.

 

C1C2C3C4C5C6C7
1 2 1 1 2 5 4
2 3 4 2 3 2 2
3 1 2 3 1 3 2
4 1 3 1 2 4 4
5 2 5 1 1 5 1
6 1 6 4 1 1 4

Suppose we use locality-sensitive hashing with three bands of two rows each. Assume there are enough buckets available that the hash function for each band can be the identity function (i.e., columns hash to the same bucket if and only if they are identical in the band). Find all the candidate pairs, and then identify one of them in the list below.

 
Your Answer ScoreExplanation
C2 and C3      
C2 and C5 Correct 1.00  
C4 and C5      
C2 and C7      
Total   1.00 / 1.00  

Question 4

Find the set of 2-shingles for the "document":

 

ABRACADABRA

 

and also for the "document":

 

BRICABRAC

 

Answer the following questions:

 

  1. How many 2-shingles does ABRACADABRA have?
  2. How many 2-shingles does BRICABRAC have?
  3. How many 2-shingles do they have in common?
  4. What is the Jaccard similarity between the two documents"?

Then, find the true statement in the list below.

 
Your Answer ScoreExplanation
ABRACADABRA has 10 2-shingles.      
ABRACADABRA has 9 2-shingles.      
There are 5 shingles in common. Correct 1.00  
There are 4 shingles in common.      
Total

 

Question 5

Suppose we want to assign points to whichever of the points (0,0) or (100,40) is nearer. Depending on whether we use the L1 or L2 norm, a point (x,y) could be clustered with a different one of these two points. For this problem, you should work out the conditions under which a point will be assigned to (0,0) when the L1 norm is used, but assigned to (100,40) when the L2 norm is used. Identify one of those points from the list below.
 
Your Answer ScoreExplanation
(53,15) Correct 1.00  
(58,13)      
(52,13)      
(54,8)      
Total   1.00 / 1.00
posted @   Zhentiw  阅读(1802)  评论(0编辑  收藏  举报
(评论功能已被禁用)
编辑推荐:
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具
点击右上角即可分享
微信分享提示