semi-structured data(notes)

data management
data model , schema

data model: colletion of concepets for describing data

schema: using model, a description of a particular collection of data

files , filesystem

文件系统相当于文件的collection，有层级的命名空间(large multimedia collections则以内容为址)

物理上bytes的lay out；存储metadata；提供Api

file format

data model; physical layout; field units and validation; metadata; plain text or binary; delimiters and escaping; compression,encryption;schema

tabular data 缺失值，错误推断(2？2.0？)，data values/types不一致，不支持其他，sensor offline...(关于sensor的缺点)

￥？$？ Wal-Mart or WalMart？...

pandas

DataFrame, table with named columns 　　python Dict(column_name->Series)

Series,column　　labeled array capable of holding any data type

Semi-Structured Data in pySpark

相当于pandas或R的DF，但是分布式。types of column从value中推断出。

pandas的DF可以转化成pySpark的DF　　spark_df.toPandas() /context.createDataFrame(pandas_df)

performance scalaDForpySparkDF
Semi-Structured Log file

Apache Common Log Format: specify the log format

identity from remote machine or local logon

-- hyphen means not available

request time: date+time+time zone

client request: request method+uniform resource identifier(uri content wanted to retrive)+client protocol version+status code sever sent back+size of object returned to client( - or 0)

splunk -dashboard ui
file performance

读/写，plain text/binary format,pandas/scala, uncompressed/compressed

(pandas 没有binary io)

LZ4 compression 和raw io差不多比gzip好。

posted on 2017-09-27 00:41 satyrs 阅读(211) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

satyrs

semi-structured data(notes)

导航

公告