Spark+HBase数据处理与存储实验部分内容

Posted on 2023-04-20 11:57  Capterlliar  阅读(95)  评论(0编辑  收藏  举报

0. Scala+Spark+HBase的IDEA环境配置

需要下载的内容:Scala、Java,注意两者之间版本是否匹配。

环境:Win10,Scala2.10.6,JDK1.7,IDEA2022.3.1

创建maven工程。

下载Scala插件。

右键项目,添加Scala框架支持。

项目结果如图所示:

scala添加为源目录,下存scala代码

添加依赖包。将property的版本号换成对应版本。依赖和插件作用在注释中。如果下载很慢记得换源。换源教程

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>untitled1</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>7</maven.compiler.source>
        <maven.compiler.target>7</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.10</scala.version>
        <spark.version>1.6.3</spark.version>
        <hbase.version>1.2.6</hbase.version>
    </properties>

    <dependencies>
        <!-- 导入scala的依赖 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.6</version>
        </dependency>
        <!-- 读取csv的依赖,非必须 -->
        <dependency>
            <groupId>au.com.bytecode</groupId>
            <artifactId>opencsv</artifactId>
            <version>2.4</version>
        </dependency>
        <!-- spark相关依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- sql驱动 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.43</version>
        </dependency
        >
        <!-- hbase相关依赖 -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>${hbase.version}</version>
        </dependency>

    </dependencies>


    <build>
        <plugins>
            <!-- scala插件需要 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
View Code

scala文件夹下添加测试代码:

import org.apache.spark.{SparkConf, SparkContext}
import au.com.bytecode.opencsv.CSVReader
import org.apache.spark.rdd.RDD

import java.io.StringReader

object WordCount {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("sparkDemo")
    val sc = SparkContext.getOrCreate(conf)
    val input = sc.textFile("file:///home/hadoop/dream.txt");
    val wordCount = input.flatMap(
      line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
    wordCount.foreach(println)
  }
}

maven clean,maven package,打包放到集群上跑一跑:

如果跑不通建议从scala输出helloworld开始调,一点一点添加依赖。

1. 了解Spark的数据读取与保存操作,尝试完成csv文件的读取和保存;

直接用sc.textFile打开会乱码,当前spark版本也不支持sparkSession,于是选择引入opencsv的依赖包

代码:

import org.apache.spark.{SparkConf, SparkContext}
import au.com.bytecode.opencsv.CSVReader

import java.io.StringReader

object WordCount {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("sparkDemo")
    val sc = SparkContext.getOrCreate(conf)
    val input = sc.textFile("file:///home/hadoop/test.csv");

    input.collect().foreach(println)
    val result = input.map { line =>
      val reader = new CSVReader(new StringReader(line));
      reader.readNext()
    }
    println(result.getClass)
    result.collect().foreach(x => {
      x.foreach(println); println("======")
    })
  }
}
View Code

可以发现spark的info很多,很吵,输出结果被淹没了,修改下配置:[转]Spark如何设置不打印INFO日志

2. 请使用Spark编程实现对每个学生所有课程总成绩与平均成绩的统计聚合,并将聚合结果存储到HBase表。

HBase表结构:

行键(number)

列簇1(information)

列簇2(score)

列簇3(stat_score)

列名

(name)

列名

(sex)

列名

(age)

列名

(123001)

列名

(123002)

列名

(123003)

列名

(sum)

列名

(avg)

学号

姓名

性别

年龄

成绩

成绩

成绩

总成绩

平均成绩

 

首先修改一下集群上的配置,不然会出现java.lang.NoClassDefFoundError

java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

读写HBase有两种方式,RDD和DataFrame,其中DataFrame扩展出DataSet。

读时使用HBase,写时使用DataSet。

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{Put, Result}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.mapreduce.Job

import scala.collection.mutable.ArrayBuffer

object WordCount {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("sparkDemo")
    val sc = SparkContext.getOrCreate(conf)

    val tablename = "student"

    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set("hbase.zookeeper.quorum", "192.168.56.121")
    hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
    hbaseConf.set(TableInputFormat.INPUT_TABLE, tablename)

    val hBaseRDD = sc.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])
    val count = hBaseRDD.count()
    println(count)

    val res=hBaseRDD.map { case (_, result) =>
      val key = Bytes.toString(result.getRow)
      val s1 = Bytes.toString(result.getValue("score".getBytes, "123001".getBytes))
      val s2 = Bytes.toString(result.getValue("score".getBytes, "123002".getBytes))
      val s3 = Bytes.toString(result.getValue("score".getBytes, "123003".getBytes))
      println(key,s1,s2,s3)
      val total=Integer.parseInt(s1)+Integer.parseInt(s2)+Integer.parseInt(s3);
      val aver=total/3;
      key+","+total+","+aver
    }
    println(res)

    val hbaseConf2 = HBaseConfiguration.create()
    hbaseConf2.set("hbase.zookeeper.quorum", "192.168.56.121")
    hbaseConf2.set("hbase.zookeeper.property.clientPort", "2181")
    hbaseConf2.set(TableOutputFormat.OUTPUT_TABLE, tablename)
    val job = new Job(hbaseConf2)
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setOutputValueClass(classOf[Result])
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

    val rdd=res.map(_.split(",")).map{arr=>
      println(arr(0),arr(1),arr(2))
      val put = new Put(Bytes.toBytes(arr(0)))
      put.addColumn(Bytes.toBytes("stat_score"), Bytes.toBytes("sum"), Bytes.toBytes(arr(1)))
      put.addColumn(Bytes.toBytes("stat_score"), Bytes.toBytes("avg"), Bytes.toBytes(arr(2)))
      (new ImmutableBytesWritable, put)
    }
    rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
    sc.stop()
  }
}
View Code