Spark中textFile从外部读取数据的用法
一、textFile源码
/** * Read a text file from HDFS, a local file system (available on all nodes), or any * Hadoop-supported file system URI, and return it as an RDD of Strings. */ def textFile( path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { assertNotStopped() hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) }
分析参数:
path: String 是一个URI,這个URI可以是HDFS、本地文件(全部的节点都可以),或者其他Hadoop支持的文件系统URI返回的是一个字符串类型的RDD,也就是是RDD的内部形式是Iterator[(String)]
minPartitions= math.min(defaultParallelism, 2) 是指定数据的分区,如果不指定分区,当你的核数大于2的时候,不指定分区数那么就是 2
当你的数据大于128M时候,Spark是为每一个快(block)创建一个分片(Hadoop-2.X之后为128m一个block)
二、代码示例
//1、这个路径表示当前目录 val path = "Current.txt" //Current fold file //2、从当前目录读取多个文件 val path = "Current1.txt,Current2.txt," //3、本地系统文件目录 val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/README.md" //local file //4、表示本地系统的一个文件夹 val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/" //local file //5、从本地系统读取多个文件 val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-scala.txt,file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-spire.txt" //local file //6、从本地系统读取多个文件夹下的文件 val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*" //local file //7、采用通配符读取相同路径后缀下的文件 val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*.txt" //local file //8、从HDFS读取一个文件 val path = "hdfs://master:9000/examples/examples/src/main/resources/people.txt" //输入路径path参数,读取对应文件 val rdd = sc.textFile(path,2)