spark常用操作(二)

复制代码
//spark读取数据
Dataset<Row> df = spark.read().textFile(currentSrcPath, 1);
Dataset<Row> df = spark.read().json(path);
Dataset<Row> df = spark.read().orc(path);
Dataset<Row> parquet = spark.read().parquet(path);

//spark写入数据
df.write().mode("overwrite").text(outputPath);
df.write().mode("overwrite").parquet(outputPath);
df.write().mode("overwrite").orc(outputPath);

//rdd转Dataset<Row>
Dataset<Row> df = spark.createDataFrame(rowRDD, AdjustSchema.row);

//list转Dataset
Dataset<String> dataset = spark.createDataset(Collections.singletonList(Long.toString(startTime)), Encoders.STRING());
复制代码

 

//从spark获取hadoop FileSystem
FileSystem fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());

 

//构建schema
public static StructType row = DataTypes.createStructType(
            Arrays.asList(
                    DataTypes.createStructField("phone_name", StringType, true),
                    DataTypes.createStructField("app_id", StringType, true)
...
));

 

复制代码
//rdd/javaRDD转DataFrame(Dataset<Row>)
Dataset<Row> personDF = spark.createDataFrame(personRDD, Person.class);
spark.createDataFrame(personRDD, PersonSchema);
personDF = spark.createDataFrame(personJavaRDD, Person.class);

//rdd转Dataset
Encoder<Person> personEncoder = Encoders.bean(Person.class);
personDS = spark.createDataset(personJavaRDD.rdd(), personEncoder);

//list直接构建Dataset
Dataset<Row> personDF = spark.createDataFrame(personList, Person.class);

//JavaRDD<Row>转Dataset<Row>
JavaRDD<Row> personRowRdd = personJavaRDD.map(person -> RowFactory.create(person.age, person.name));
personDF = spark.createDataFrame(personRowRdd, rowAgeNameSchema);

//Dataset<Person> -> JavaRDD<Person>
personJavaRDD = personDS.toJavaRDD();

//Dataset<Row> -> JavaRDD<Person>
personJavaRDD = personDF.toJavaRDD().map(row -> {
          String name = row.getAs("name");
          int age = row.getAs("age");
          return new Person(name, age);
      });

//Dataset<Person> -> Dataset<Row>
ExpressionEncoder<Row> rowEncoder = RowEncoder.apply(rowSchema);
      Dataset<Row> personDF_fromDS = personDS.map(
              (MapFunction<Person, Row>) person -> {
                  List<Object> objectList = new ArrayList<>();
                  objectList.add(person.name);
                  objectList.add(person.age);
                  return RowFactory.create(objectList.toArray());
              },
              rowEncoder
      );

//Dataset<Row> -> Dataset<Person>
personDS = personDF.map(new MapFunction<Row, Person>() {
          @Override
          public Person call(Row value) throws Exception {
              return new Person(value.getAs("name"), value.getAs("age"));
          }
      }, personEncoder);
复制代码

 

posted @   Mars.wang  阅读(570)  评论(0编辑  收藏  举报
编辑推荐:
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
阅读排行:
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 零经验选手,Compose 一天开发一款小游戏!
· 通过 API 将Deepseek响应流式内容输出到前端
历史上的今天:
2018-05-08 java多线程
2018-05-08 java关键字-interface
2018-05-08 java关键字-abstract
2018-05-08 java设计模式-单例(singleton)
2018-05-08 java关键字-static
2018-05-08 java内部类
2018-05-08 java设计模式
点击右上角即可分享
微信分享提示