spark生成很多行/分区的表
连接spark-shell
指定行数生成数据
scala>spark.range(50000000L).selectExpr("id % 10000 as a","id % 10000 as b").write.saveAsTable("t1")
scala>spark.range(40000000L).selectExpr("id % 8000 as c", "id % 8000 as d").write.saveAsTable("t2")
scala>spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM t2").explain
指定行数生成数据——指定parquet格式
scala>spark.range(50000000L).selectExpr("id % 10000 as a","id % 10000 as b").write.format("parquet").saveAsTable("t1")
scala>spark.range(50000000L).selectExpr("id % 10000 as a","id % 10000 as b").write.format("parquet").saveAsTable("t2")
指定分区数生成数据
spark.range(10000).select(col("id"),col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").saveAsTable("iteblog_tab1")
spark.range(100).select(col("id"),col("id").as("k")).write.partitionBy("k").format("parquet").mode("overwrite").saveAsTable("iteblog_tab2")
spark.sql("SELECT * FROM iteblog_tab1 t1 JOIN iteblog_tab2 t2 ON t1.k = t2.k AND t2.id < 2").show()