order_created.txt 订单编号 订单创建时间
10703007267488 2014-05-01 06:01:12.334+01 10101043505096 2014-05-01 07:28:12.342+01 10103043509747 2014-05-01 07:50:12.33+01 10103043501575 2014-05-01 09:27:12.33+01 10104043514061 2014-05-01 09:03:12.324+01
order_picked.txt 订单编号 订单提取时间
10703007267488 2014-05-01 07:02:12.334+01 10101043505096 2014-05-01 08:29:12.342+01 10103043509747 2014-05-01 10:55:12.33+01
上传上述两个文件到HDFS:
hadoop fs -put order_created.txt /data/order_created.txt
hadoop fs -put order_picked.txt /data/order_picked.txt
通过Spark SQL关联查询两个文件
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ case class OrderCreated(order_no:String,create_date:String) case class OrderPicked(order_no:String,picked_date:String) val order_created = sc.textFile("/data/order_created.txt").map(_.split("\t")).map( d => OrderCreated(d(0),d(1))) val order_picked = sc.textFile("/data/order_picked.txt").map(_.split("\t")).map( d => OrderPicked(d(0),d(1))) order_created.registerTempTable("t_order_created") order_picked.registerTempTable("t_order_picked") #手工设置Spark SQL task个数 hiveContext.setConf("spark.sql.shuffle.partitions","10") hiveContext.sql("select a.order_no, a.create_date, b.picked_date from t_order_created a join t_order_picked b on a.order_no = b.order_no").collect.foreach(println)
执行结果如下:
[10101043505096,2014-05-01 07:28:12.342+01,2014-05-01 08:29:12.342+01] [10703007267488,2014-05-01 06:01:12.334+01,2014-05-01 07:02:12.334+01] [10103043509747,2014-05-01 07:50:12.33+01,2014-05-01 10:55:12.33+01]