Spark中如何生成Avro文件

研究spark的目的之一就是要取代MR,目前我司MR的一个典型应用场景即为生成Avro文件,然后加载到HIVE表里,所以如何在Spark中生成Avro文件,就是必然之路了。

我本人由于对java不熟,对hadoop的各个框架也知之寥寥,所以耗费了五六天的时间才搞明白怎么在spark中生成avro文件。其中困难有二,一是网上关于spark中生成avro的资料太少,无论官网还是各种论坛,要么没有讯息,要么就是三言两语,对于我这个菜鸟而言,真是要了老命;二是在spark生成avro的代码中,用到了avro框架和hadoop框架的东西,他们自己底层的引用,又有可能和spark的底层引用冲突,虽然最终解决了问题,但是对于问题的直接原因,还么有弄明白。

对于Java的老手,对于hadoop的生态又比较熟悉的人,估计这个课题一天之内就解决了。这里我不怕大家笑话,将目前在本地能跑成功的代码贴出来,还多请指教。还没有提交到集群中去。

 

1.代码片段

2.pom文件

3.avro格式和文本

 

1.代码片段

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.NullWritable
import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyOutputFormat
import org.apache.avro.mapreduce._

object TextTest extends App {
  System.setProperty("hadoop.home.dir", "D:\\bd\\software\\winutils")
  val sparkConf = new SparkConf().setAppName("AvroTest").setMaster("local[4]")
  val sc = new SparkContext(sparkConf)
  
  //**************************to generate an avro file based on internal java type
  var li = List("A","A","C","B")
  var lip = sc.parallelize(li, 4)
  var liprdd = lip.map { x => (new AvroKey[String](x),NullWritable.get()) }
  var prdd = new PairRDDFunctions(liprdd)
  val schema = Schema.create(Schema.Type.STRING)
  val job1 = Job.getInstance
  AvroJob.setOutputKeySchema(job1, schema)
  prdd.saveAsNewAPIHadoopFile("D:/002", classOf[AvroKey[String]], classOf[NullWritable],
      classOf[AvroKeyOutputFormat[String]], job1.getConfiguration)
  println("job1 done")
  
  
  //**************************to generate an avro file based on avro type
  var av = sc.textFile("D://bdp//NewHuman//Users.txt",5)
  var job = Job.getInstance
  AvroJob.setOutputKeySchema(job, User.getClassSchema)
  val avArray = av.map(x => x.split(" "))
  
  val userP = avArray.map { x => (new AvroKey[User](User.newBuilder().setFavoriteNumber(Integer.parseInt(x(2))).setName(x(0)).setFavoriteColor(x(1)).build()),NullWritable.get()) }

  var avP = new PairRDDFunctions(userP)
  
   avP.saveAsNewAPIHadoopFile("D:/003", classOf[AvroKey[User]], classOf[NullWritable],
      classOf[AvroKeyOutputFormat[User]], job.getConfiguration)
  
  println("job2 done")
}

  代码中演示了两种类型的场景,一种是内存技术的,一种是外部文件。其中需要注意的是,必须要用AvroJob去设定schema,再者就是只有pairRDD才有saveAsNewAPIHadoop方法,所以其他的RDD必须要转成PairRDD。

  另外,上面代码中的User类是利用avro自动生成的,需要引用进来。

2.pom文件

  1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  2     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  3     <modelVersion>4.0.0</modelVersion>
  4 
  5 
  6     <dependencies>
  7         
  8         <dependency>
  9             <groupId>jdk.tools</groupId>
 10             <artifactId>jdk.tools</artifactId>
 11             <version>1.7.0_67</version>
 12             <scope>system</scope>
 13             <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
 14         </dependency>
 15 
 16         <dependency>
 17         <groupId>com.google.guava</groupId>
 18         <artifactId>guava</artifactId>
 19         <version>16.0.1</version>
 20     </dependency>
 21         
 22 
 23         <dependency>
 24             <groupId>org.apache.spark</groupId>
 25             <artifactId>spark-core_2.10</artifactId>
 26             <version>1.2.2</version>
 27         </dependency>
 28 
 29         <dependency>
 30             <groupId>org.scala-lang</groupId>
 31             <artifactId>scala-library</artifactId>
 32             <version>2.10.4</version>
 33         </dependency>
 34         <dependency>
 35             <groupId>junit</groupId>
 36             <artifactId>junit</artifactId>
 37             <version>3.8.1</version>
 38             <scope>test</scope>
 39         </dependency>
 40 
 41         <dependency>
 42             <groupId>org.apache.kafka</groupId>
 43             <artifactId>kafka_2.10</artifactId>
 44             <version>0.8.1.1</version>
 45         </dependency>
 46 
 47         <dependency>
 48             <groupId>org.apache.spark</groupId>
 49             <artifactId>spark-streaming_2.10</artifactId>
 50             <version>1.2.2</version>
 51         </dependency>
 52         <dependency>
 53             <groupId>org.apache.spark</groupId>
 54             <artifactId>spark-streaming-kafka_2.10</artifactId>
 55             <version>1.2.2</version>
 56         </dependency>
 57 
 58         <dependency>
 59             <groupId>org.apache.avro</groupId>
 60             <artifactId>avro</artifactId>
 61             <version>1.7.4</version>
 62         </dependency>
 63 
 64         <dependency>
 65             <groupId>org.apache.avro</groupId>
 66             <artifactId>avro-mapred</artifactId>
 67             <version>1.7.4</version>
 68             <classifier>hadoop2</classifier>
 69         </dependency>
 70 
 71         <dependency>
 72             <groupId>org.apache.hadoop</groupId>
 73             <artifactId>hadoop-common</artifactId>
 74             <version>2.2.0</version>
 75         </dependency>
 76 
 77         <dependency>
 78             <groupId>org.apache.hadoop</groupId>
 79             <artifactId>hadoop-core</artifactId>
 80             <version>1.0.4</version>
 81         </dependency>
 82 
 83     </dependencies>
 84 
 85     <repositories>
 86         <repository>
 87             <id>scala-tools.org</id>
 88             <name>Scala-tools Maven2 Repository</name>
 89             <url>http://scala-tools.org/repo-releases</url>
 90         </repository>
 91     </repositories>
 92 
 93     <pluginRepositories>
 94         <pluginRepository>
 95             <id>scala-tools.org</id>
 96             <name>Scala-tools Maven2 Repository</name>
 97             <url>http://scala-tools.org/repo-releases</url>
 98         </pluginRepository>
 99     </pluginRepositories>
100 
101     <build>
102         <sourceDirectory>src</sourceDirectory>
103         <pluginManagement>
104             <plugins>
105                 <plugin>
106                     <groupId>net.alchim31.maven</groupId>
107                     <artifactId>maven-scala-plugin</artifactId>
108                     <version>3.2.0</version>
109                 </plugin>
110                 <plugin>
111                     <groupId>org.apache.maven.plugins</groupId>
112                     <artifactId>maven-compiler-plugin</artifactId>
113                     <version>3.1</version>
114                 </plugin>
115             </plugins>
116         </pluginManagement>
117         <plugins>
118             <plugin>
119                 <groupId>org.scala-tools</groupId>
120                 <artifactId>maven-scala-plugin</artifactId>
121                 <executions>
122                     <execution>
123                         <id>scala-compile-first</id>
124                         <phase>process-resources</phase>
125                         <goals>
126                             <goal>add-source</goal>
127                             <goal>compile</goal>
128                         </goals>
129                     </execution>
130                 </executions>
131             </plugin>
132         </plugins>
133     </build>
134 </project>

这个文件中要注意的是dependency的顺序不能变,由于spark和avro的底层的外部引用可能会冲突。

3.avro格式和文本

avro格式为

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

测试用例的文本Users.txt


Josh Green 13
Ken Yellow 6
Xiao Orange 8
Gerry Black 12

 

posted @ 2015-08-11 15:13  boiledwater  阅读(2285)  评论(0编辑  收藏  举报