spark.sql.shuffle.partitions 和 spark.default.parallelism 的区别
在关于spark任务并行度的设置中,有两个参数我们会经常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什么区别的?
首先,让我们来看下它们的定义
Property Name | Default | Meaning |
---|---|---|
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager: - Local mode: number of cores on the local machine - Mesos fine grained mode: 8 - Others: total number of cores on all executor nodes or 2, whichever is larger |
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
看起来它们的定义似乎也很相似,但在实际测试中,
- spark.default.parallelism只有在处理RDD时才会起作用,对Spark SQL的无效。
- spark.sql.shuffle.partitions则是对sparks SQL专用的设置