spark 数据读取与保存
在Scala中读取文本文件:
val input = sc.textFile("..")
一个目录下多个文件读取可用wholeTextFiles()方法
保存文本文件
result.saveAsTextFile(outputFile)
在Scala中读取JSON
import com.fasterxml.jackson.module.scala.DefaultScalaModelimport import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.jackson.databind.DeserializationFeature ... case class Person(name:String, lovespandas:Boolean) val result = input.flatMap(record =>{ try{ some(mapper.readValue(record, classOf[Person])) } catch{ case e: Exception => None }})
在Scala中使用textFile()读取CSV(假设CSV数据字段没有包含换行符)
import java.io.StringReader import au.com.bytecode.opencsv.CSVReader val input = textFile(inputFile) val result = input.map{line => val reader = new CSVReader(new StringReader(line)); reader.readNext(); }
若字段嵌有换行符,完整读取后解析
case class Person(name: String, fa: String) val input = sc.wholeTextFiles(inputFile) val result = input.flatMap{ case (_, txt) => val reader = new CSVReader( new StringReader(txt)); reader.readAll().map(x => Person(x(0), x(1)) }
读取SequenceFile
val data = sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, x.get())}
保存SequenceFile
data.saveAsSequenceFile(outFile)