semi-structured data(notes)

  • data management
  • data model , schema

data model: colletion of concepets for describing data

schema: using model, a description of a particular collection of data

  • files , filesystem

文件系统相当于文件的collection,有层级的命名空间(large multimedia collections则以内容为址)

物理上bytes的lay out;存储metadata;提供Api

  • file format

data model; physical layout; field units and validation; metadata; plain text or binary; delimiters and escaping; compression,encryption;schema

  • tabular data 缺失值,错误推断(2?2.0?),data values/types不一致,不支持其他,sensor offline...(关于sensor的缺点)

¥?$?  Wal-Mart or WalMart?...

  • pandas

DataFrame, table with named columns   python Dict(column_name->Series)

Series,column  labeled array capable of holding any data type

  • Semi-Structured Data in pySpark

相当于pandas或R的DF,但是分布式。types of column从value中推断出。

pandas的DF可以转化成pySpark的DF  spark_df.toPandas()  /context.createDataFrame(pandas_df)

  • performance scalaDForpySparkDF
  • Semi-Structured Log file

Apache Common Log Format: specify the log format

identity from remote machine or local logon

-- hyphen means not available

request time: date+time+time zone

client request: request method+uniform resource identifier(uri content wanted to retrive)+client protocol version+status code sever sent back+size of object returned to client( - or 0)

  • splunk -dashboard ui
  • file performance

读/写,plain text/binary format,pandas/scala, uncompressed/compressed

(pandas 没有binary io)

LZ4 compression 和raw io差不多比gzip好。

 

posted on 2017-09-27 00:41  satyrs  阅读(211)  评论(0编辑  收藏  举报

导航