RDD的创建

RDD的几种创建方式

1.parallelize,可指定分区数

复制代码
scala> val rdd1 = sc.parallelize(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> rdd1.collect
res14: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd1.getNumPartitions
res15: Int = 2

scala> val rdd1 = sc.parallelize(1 to 10, 4)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> rdd1.collect
res16: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd1.getNumPartitions
res15: Int = 4
复制代码

2.range,左闭右开,步长默认为1

复制代码
scala> val rdd1 = sc.range(1,11)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[17] at range at <console>:24

scala> rdd1.collect
res20: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val rdd1 = sc.range(1,11,2)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at range at <console>:24

scala> rdd1.collect
res0: Array[Long] = Array(1, 3, 5, 7, 9)  
复制代码

3.makeRDD,指定分区数时和parallelize一致,不指定分区时官方注释makeRDD会为每个集合创建最佳的分区,对后续调整优化有帮助

复制代码
scala> val lst = List(1,3,4,5,6,7,9)
lst: List[Int] = List(1, 3, 4, 5, 6, 7, 9)

scala> val rdd1 = sc.parallelize(lst)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:26

scala> rdd1.collect
res3: Array[Int] = Array(1, 3, 4, 5, 6, 7, 9)

scala> rdd1.getNumPartitions
res4: Int = 2

scala> val rdd1 = sc.makeRDD(lst)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at <console>:26

scala> rdd1.collect
res5: Array[Int] = Array(1, 3, 4, 5, 6, 7, 9)

scala> rdd1.getNumPartitions
res6: Int = 2
复制代码

4.从本地文件系统加载数据

scala> val rdd1 = sc.textFile("file:///data/hello.txt")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs:///data/hello.txt MapPartitionsRDD[9] at textFile at <console>:24

scala> rdd1.collect
res9: Array[String] = Array(hello spark, this is a local file, hello zhangcong)

5.从分布式文件系统加载数据,以hdfs为例

scala> val rdd1 = sc.textFile("hdfs:///data/hello.txt")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs:///data/hello.txt MapPartitionsRDD[9] at textFile at <console>:24

scala> rdd1.collect
res9: Array[String] = Array(hello spark, this is a hdfs file, hello zhangcong)

6.从RDD创建RDD,本质是将一个RDD转换成另一个RDD,详见RDD的转换

posted @   NeilCheung514  阅读(146)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?
点击右上角即可分享
微信分享提示