spark几种读文件的方式
spark.read.textFile和sc.textFile的区别
val rdd1 = spark.read.textFile("hdfs://han02:9000/words.txt") //读取到的是一个RDD对象
val rdd2 = sc.textFile("hdfs://han02:9000/words.txt") //读取到的是一个Dataset的数据集
分别进行单词统计的方法:
rdd1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false)
rdd2.flatMap(x=>x.split(" ")).groupByKey(x=>x).count()
前者返回Array[(String,Int)],后者返回Array[(String,Long)]
TextFile(url,num)///num为设置分区个数文件超过(128)
1.从当前目录读取一个文件:
val path = "Current.txt" //Current fold file val rdd1 = sc.textFile(path,2)
2.从当前目录读取一个文件:
val path = "Current1.txt,Current2.txt," //Current fold file val rdd1 = sc.textFile(path,2)
3.从本地读取一个文件:
val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/README.md" //local file val rdd1 = sc.textFile(path,2)
4.从本地读取一个文件夹中的内容:
val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/" //local file val rdd1 = sc.textFile(path,2)
5.从本地读取一个多个文件:
val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-scala.txt,file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-spire.txt" //local file val rdd1 = sc.textFile(path,2)
6.从本地读取多个文件夹中的内容:
val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*" //local file
val rdd1 = sc.textFile(path,2)
val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*.txt" //local file,指定后缀名文件
val rdd1 = sc.textFile(path,2)
7.采用通配符读取相似的文件中的内容:
for (i <- 1 to 2){ val rdd1 = sc.textFile(s"/root/application/temp/people$i*",2) }
eg:google中的文件读取不了
本文来自博客园踩坑狭,作者:韩若明瞳,转载请注明原文链接:https://www.cnblogs.com/han-guang-xue/p/10034153.html