spark sql 加载数据

Load Data
1) RDD DataFrame/Dataset
2) Local Cloud(HDFS/S3)


将数据加载成RDD
val masterLog = sc.textFile("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/logs/spark-arthurlance-org.apache.spark.deploy.master.Master-1-ArthurdeMacBook-Pro.local.out")
val workerLog = sc.textFile("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/logs/spark-arthurlance-org.apache.spark.deploy.worker.Worker-1-ArthurdeMacBook-Pro.local.out")
val allLog = sc.textFile("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/logs/*out*")

masterLog.count
workerLog.count
allLog.count

存在的问题:使用使用SQL进行查询呢?

import org.apache.spark.sql.Row
val masterRDD = masterLog.map(x => Row(x))
import org.apache.spark.sql.types._
val schemaString = "line"

val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

val masterDF = spark.createDataFrame(masterRDD, schema)
masterDF.show


JSON/Parquet
val usersDF = spark.read.format("parquet").load("file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet")
usersDF.show


spark.sql("select * from parquet.`file:///Users/arthurlance/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet`").show

Drill 大数据处理框架


从Cloud读取数据: HDFS/S3
val hdfsRDD = sc.textFile("hdfs://path/file")
val s3RDD = sc.textFile("s3a://bucket/object")
s3a/s3n

spark.read.format("text").load("hdfs://path/file")
spark.read.format("text").load("s3a://bucket/object")

 

posted @ 2019-04-15 22:49  Arthur-Lance  阅读(748)  评论(0编辑  收藏  举报