semi-structured data(notes)
- data management
- data model , schema
data model: colletion of concepets for describing data
schema: using model, a description of a particular collection of data
- files , filesystem
文件系统相当于文件的collection,有层级的命名空间(large multimedia collections则以内容为址)
物理上bytes的lay out;存储metadata;提供Api
- file format
data model; physical layout; field units and validation; metadata; plain text or binary; delimiters and escaping; compression,encryption;schema
- tabular data 缺失值,错误推断(2?2.0?),data values/types不一致,不支持其他,sensor offline...(关于sensor的缺点)
¥?$? Wal-Mart or WalMart?...
- pandas
DataFrame, table with named columns python Dict(column_name->Series)
Series,column labeled array capable of holding any data type
- Semi-Structured Data in pySpark
相当于pandas或R的DF,但是分布式。types of column从value中推断出。
pandas的DF可以转化成pySpark的DF spark_df.toPandas() /context.createDataFrame(pandas_df)
- performance scalaDForpySparkDF
- Semi-Structured Log file
Apache Common Log Format: specify the log format
identity from remote machine or local logon
-- hyphen means not available
request time: date+time+time zone
client request: request method+uniform resource identifier(uri content wanted to retrive)+client protocol version+status code sever sent back+size of object returned to client( - or 0)
- splunk -dashboard ui
- file performance
读/写,plain text/binary format,pandas/scala, uncompressed/compressed
(pandas 没有binary io)
LZ4 compression 和raw io差不多比gzip好。