Spark:JavaRDD 转化为 Dataset<Row>的两种方案
JavaRDD 转化为 Dataset<Row>方案一:
实体类作为schema定义规范,使用反射,实现JavaRDD转化为Dataset<Row>
Student.java实体类:
import java.io.Serializable; @SuppressWarnings("serial") public class Student implements Serializable { private String sid; private String sname; private int sage; public String getSid() { return sid; } public void setSid(String sid) { this.sid = sid; } public String getSname() { return sname; } public void setSname(String sname) { this.sname = sname; } public int getSage() { return sage; } public void setSage(int sage) { this.sage = sage; } @Override public String toString() { return "Student [sid=" + sid + ", sname=" + sname + ", sage=" + sage + "]"; } }
实现代码:
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate(); final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); JavaRDD<String> source = spark.read().textFile("stuInfo.txt").javaRDD(); JavaRDD<Student> rowRDD = source.map(new Function<String, Student>() { public Student call(String line) throws Exception { String parts[] = line.split(","); Student stu = new Student(); stu.setSid(parts[0]); stu.setSname(parts[1]); stu.setSage(Integer.valueOf(parts[2])); return stu; } }); Dataset<Row> df = spark.createDataFrame(rowRDD, Student.class); df.select("sid", "sname", "sage").coalesce(1).write().mode(SaveMode.Append).parquet("parquet.res");
JavaRDD 转化为 Dataset<Row>方案二:
使用schema生成方案
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate(); final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); JavaRDD<String> source = spark.read().textFile("stuInfo.txt").javaRDD(); JavaRDD<Row> rowRDD = source.map(new Function<String, Row>() { public Row call(String line) throws Exception { String[] parts = line.split(","); String sid = parts[0]; String sname = parts[1]; int sage = Integer.parseInt(parts[2]); return RowFactory.create(sid, sname, sage); } }); ArrayList<StructField> fields = new ArrayList<StructField>(); StructField field = null; field = DataTypes.createStructField("sid", DataTypes.StringType, true); fields.add(field); field = DataTypes.createStructField("sname", DataTypes.StringType, true); fields.add(field); field = DataTypes.createStructField("sage", DataTypes.IntegerType, true); fields.add(field); StructType schema = DataTypes.createStructType(fields); Dataset<Row> df = spark.createDataFrame(rowRDD, schema); df.coalesce(1).write().mode(SaveMode.Append).parquet("parquet.res1");
基础才是编程人员应该深入研究的问题,比如:
1)List/Set/Map内部组成原理|区别
2)mysql索引存储结构&如何调优/b-tree特点、计算复杂度及影响复杂度的因素。。。
3)JVM运行组成与原理及调优
4)Java类加载器运行原理
5)Java中GC过程原理|使用的回收算法原理
6)Redis中hash一致性实现及与hash其他区别
7)Java多线程、线程池开发、管理Lock与Synchroined区别
8)Spring IOC/AOP 原理;加载过程的。。。
【+加关注】。