lecture 1
1. not double pass, all homeworks submitted through give. All exams questions are short answers
2. consultation is Friday through zoom from 1 to 2
3. major characteristics of big data: volume, variety, velocity, value, visibility, variability, veracity
数量(size),种类,速度,值,可见性,可变性,准确性
4. volume
quantity of data being created from all sources
带来的问题是存储空间不够以及time complexity
volume增加cost增加
5. variety
1) different types:
relational data( tables\transactions ) with structures, fixed schema
text data( books, reports) unstructures
semi-structured data(JSON, XML ) has orgnizations as well
graph data( social network, RDF )
image\video data( Instagram, Youtube )
2) different sources
实际应用中,一个app会有多种来源或由不同种类的信息组合成的
data integration的主要问题为Heterogeneous,data integration将来源不用的信息组合成一个独特的view,Heterogeneous是指the schema of view is different with each other。传统的解决方法是schema mapping。解决难度和时间复杂度与the level of heterogentity和data sources有关。另一个问题是record linkage in variety data,指wether 2 records refer to the same entity or not,需要我们尽可能详尽的使用来源不同的各种不同信息
data curation指organization and integration of data collected from various sources,可能出现的问题是long tail of data variety
data curation即使数据更有序可以减少long tail
6. velocity( speed )
很多应用需要及时回馈
需要解决的是batch processing, real time processing与transmission