spark生成很多行/分区的表

连接spark-shell

指定行数生成数据

scala>spark.range(50000000L).selectExpr("id % 10000 as a","id % 10000 as b").write.saveAsTable("t1")
scala>spark.range(40000000L).selectExpr("id % 8000 as c", "id % 8000 as d").write.saveAsTable("t2")

scala>spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM t2").explain

 

指定行数生成数据——指定parquet格式

scala>spark.range(50000000L).selectExpr("id % 10000 as a","id % 10000 as b").write.format("parquet").saveAsTable("t1")
scala>spark.range(50000000L).selectExpr("id % 10000 as a","id % 10000 as b").write.format("parquet").saveAsTable("t2")

 

指定分区数生成数据

spark.range(10000).select(col("id"),col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").saveAsTable("iteblog_tab1")
spark.range(100).select(col("id"),col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").saveAsTable("iteblog_tab2")

spark.sql("SELECT * FROM iteblog_tab1 t1 JOIN iteblog_tab2 t2 ON t1.k = t2.k AND t2.id < 2").show()

 

参考https://blog.51cto.com/u_15127589/2678267

posted @ 2021-04-20 15:18  七彩木兰  阅读(296)  评论(0编辑  收藏  举报