|NO.Z.00031|——————————|BigDataEnd|——|Hadoop&Spark.V05|——|Spark.v05|sparkcore|RDD编程高阶&RDD分区数|

一、RDD的分区

### --- RDD分区

~~~     spark.default.parallelism：（默认的并发数）= 2
~~~     当配置文件spark-default.conf中没有显示的配置，则按照如下规则取值：

二、RDD分区示例

### --- 本地模式

~~~     # spark-shell --master local[N] spark.default.parallelism = N
[root@hadoop02 ~]# spark-shell --master local spark.default.parallelism = 1

### --- 伪分布式（x为本机上启动的executor数，y为每个executor使用的core数，z为每个 executor使用的内存）

~~~     # spark-shell --master local-cluster[x,y,z]
[root@hadoop02 ~]# spark.default.parallelism = x * y

### --- 分布式模式（yarn & standalone）

~~~     spark.default.parallelism = max(应用程序持有executor的core总数, 2)

### --- 备注：

~~~     total number of cores on all executor nodes or 2, whichever is larger经过上面的规则，
~~~     就能确定了spark.default.parallelism的默认值（配置文件sparkdefault.conf中没有显示的配置。
~~~     如果配置了，则spark.default.parallelism = 配置的值）
~~~     SparkContext初始化时，同时会生成两个参数，由上面得到的
~~~     spark.default.parallelism推导出这两个参数的值

~~~     # 从集合中创建RDD的分区数
sc.defaultParallelism = spark.default.parallelism

~~~     # 从文件中创建RDD的分区数
sc.defaultMinPartitions = min(spark.default.parallelism, 2)

### --- 代码提取说明：以上参数确定后，就可以计算 RDD 的分区数了。

~~~     # 代码提取说明：sparkcontext.scala
~~~     # 2363行
  /**
   * Default min number of partitions for Hadoop RDDs when not given by user
   * Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
   * The reasons for this are discussed in https://github.com/mesos/spark/pull/718
   */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

三、创建 RDD 的几种方式：

### --- 通过集合创建
~~~     备注：简单的说RDD分区数等于cores总数

~~~     # 如果创建RDD时没有指定分区数，则rdd的分区数 = sc.defaultParallelism
scala> val rdd = sc.parallelize(1 to 100)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:24

scala> rdd.getNumPartitions
res29: Int = 3

### --- 通过textFile创建

~~~     # 准备配置文件
[root@hadoop02 ~]# hdfs dfs -ls /user/root/data/start0000.big.log
-rw-r--r--   5 root supergroup          0 2021-10-19 21:58 /user/root/data/start0000.big.log

scala> val rdd = sc.textFile("data/start0000.big.log")
rdd: org.apache.spark.rdd.RDD[String] = data/start0000.big.log MapPartitionsRDD[32] at textFile at <console>:24

scala> rdd.getNumPartitions
res32: Int = 2

### --- 如果没有指定分区数：

~~~     本地文件。rdd的分区数 = max(本地文件分片数, sc.defaultMinPartitions)
~~~     HDFS文件。 rdd的分区数 = max(hdfs文件 block 数, sc.defaultMinPartitions)

### --- 备注：

~~~     本地文件分片数 = 本地文件大小 / 32M
~~~     如果读取的是HDFS文件，同时指定的分区数 < hdfs文件的block数，指定的数不生效。

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart

——W.S.Landor