RDD的创建
RDD的几种创建方式
1.parallelize,可指定分区数
scala> val rdd1 = sc.parallelize(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> rdd1.collect res14: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> rdd1.getNumPartitions res15: Int = 2 scala> val rdd1 = sc.parallelize(1 to 10, 4) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24 scala> rdd1.collect res16: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd1.getNumPartitions
res15: Int = 4
2.range,左闭右开,步长默认为1
scala> val rdd1 = sc.range(1,11) rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[17] at range at <console>:24 scala> rdd1.collect res20: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val rdd1 = sc.range(1,11,2) rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at range at <console>:24 scala> rdd1.collect res0: Array[Long] = Array(1, 3, 5, 7, 9)
3.makeRDD,指定分区数时和parallelize一致,不指定分区时官方注释makeRDD会为每个集合创建最佳的分区,对后续调整优化有帮助
scala> val lst = List(1,3,4,5,6,7,9) lst: List[Int] = List(1, 3, 4, 5, 6, 7, 9) scala> val rdd1 = sc.parallelize(lst) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:26 scala> rdd1.collect res3: Array[Int] = Array(1, 3, 4, 5, 6, 7, 9) scala> rdd1.getNumPartitions res4: Int = 2 scala> val rdd1 = sc.makeRDD(lst) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at makeRDD at <console>:26 scala> rdd1.collect res5: Array[Int] = Array(1, 3, 4, 5, 6, 7, 9) scala> rdd1.getNumPartitions res6: Int = 2
4.从本地文件系统加载数据
scala> val rdd1 = sc.textFile("file:///data/hello.txt") rdd1: org.apache.spark.rdd.RDD[String] = hdfs:///data/hello.txt MapPartitionsRDD[9] at textFile at <console>:24 scala> rdd1.collect res9: Array[String] = Array(hello spark, this is a local file, hello zhangcong)
5.从分布式文件系统加载数据,以hdfs为例
scala> val rdd1 = sc.textFile("hdfs:///data/hello.txt") rdd1: org.apache.spark.rdd.RDD[String] = hdfs:///data/hello.txt MapPartitionsRDD[9] at textFile at <console>:24 scala> rdd1.collect res9: Array[String] = Array(hello spark, this is a hdfs file, hello zhangcong)
6.从RDD创建RDD,本质是将一个RDD转换成另一个RDD,详见RDD的转换
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?