学习进度（9）

　　今天学了一点spark的内容，做了实验四的第一个。

1 ．spark-shell 交互式编程

请到本教程官网的“下载专区”的“数据集”中下载 chapter5-data1.txt，该数据集包含了某大学计算机系的成绩，数据格式如下所示：

Tom,DataBase,80
Tom,Algorithm,50
Tom,DataStructure,60
Jim,DataBase,90
Jim,Algorithm,60
Jim,DataStructure,80
……

请根据给定的实验数据，在 spark-shell 中通过编程来计算以下内容：

（1）该系总共有多少学生；

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt")
val persons = file.map(row=>row.split(",")(0))
val dist_per = persons.distinct()
dist_per.count

（2）该系共开设来多少门课程；

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt")
val persons = file.map(row=>row.split(",")(0))
var courses = file.map(row=>row.split(",")(1))
dist_cou.count

（3）Tom 同学的总成绩平均分是多少；

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt")
file.filter(row=>row.split(",")(0)=="Tom")
.map(row=>(row.split(",")(0),row.split(",")(2).toInt))
.mapValues(x=>(x,1)).
reduceByKey((x,y) => (x._1+y._1,x._2 + y._2))
.mapValues(x => (x._1 / x._2))
.collect()

（4）求每名同学的选修的课程门数；

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt")
file.map(row=>(row.split(",")(0),row.split(",")(1))).
mapValues(x=>(1)).
reduceByKey((x,y)=>(x+y)).
collect()

（5）该系 DataBase 课程共有多少人选修；

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt")
file.filter(row=>row.split(",")(1)=="DataBase").count()

（6）各门课程的平均分是多少；

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt") 
file.map(row=>(row.split(",")(1),row.split(",")(2).toInt)).
mapValues(x=>(x,1)).
reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).
mapValues(x=>(x._1/x._2)).
collect()

（7）使用累加器计算共有多少人选了 DataBase 这门课。

val file = sc.textFile("file:///home/hadoop/Code/Data01.txt")
val pare = file.filter(row=>row.split(",")(1)=="DataBase").
map(row=>(row.split(",")(1),1))
val accum =sc.accumulator(0)
pare.values.foreach(x => accum.add(x))
accum.value

2. 编写独立应用程序实现数据去重

对于两个输入文件 A 和 B，编写 Spark 独立应用程序，对两个文件进行合并，并剔除其中重复的内容，得到一个新文件 C。下面是输入文件和输出文件的一个样例，供参考。

输入文件 A 的样例如下：

20170101 x
20170102 y
20170103 x
20170104 y
20170105 z
20170106 z

输入文件 B 的样例如下：

20170101 y
20170102 y
20170103 x
20170104 z
20170105 y

根据输入的文件 A 和 B 合并得到的输出文件 C 的样例如下：

20170101 x
20170101 y
20170102 y
20170103 x
20170104 y
20170104 z
20170105 y
20170105 z
20170106 z

import scala.io.Source
import java.io.PrintWriter
import java.io.File
import Array._
import scala.util.control._

object exercise4 {
   def main(args: Array[String]){
    File() 
    
  }
   def File(){
     val AFile=InFile("A.txt")
     val BFile=InFile("B.txt")
     var CFile=concat( AFile, BFile)
     var CFile2=new Array[String](CFile.size)
     val loop = new Breaks;
     
     
     for(i<-CFile){
      loop.breakable{
       for(j<- 0 to CFile2.size-1 ){
         if(CFile2(j)!=null){if(i==CFile2(j))loop.break;}
         else {CFile2(j)=i;loop.break;}
       }
     }
     }
      outFile(CFile2)
   }
   def InFile(path:String) : Array[String] ={
    val source = Source.fromFile(path, "UTF-8")
    val lines = source.getLines().toArray
    return lines
     
   }
   def outFile(data:Array[String]){
    val writer = new PrintWriter(new File("C.txt"))
    for(i <-data)
    if(i!=null)writer.println(i)
    writer.close()
   }
}

3. 编写独立应用程序实现求平均值问题

每个输入文件表示班级学生某个学科的成绩，每行内容由两个字段组成，第一个是学生名字，第二个是学生的成绩；编写 Spark 独立应用程序求出所有学生的平均成绩，并输出到一个新文件中。下面是输入文件和输出文件的一个样例，供参考。

Algorithm 成绩：
小明 92
小红 87
小新 82
小丽 90
Database 成绩：
小明 95
小红 81
小新 89
小丽 85
Python 成绩：
小明 82
小红 83
小新 94
小丽 91
平均成绩如下：
(小红,83.67)
(小新,88.33)
(小明,89.67)
(小丽,88.67)

import scala.io.Source
import java.io.PrintWriter
import java.io.File
import Array._
import scala.util.control._

object exercise4 {
  def main(args: Array[String]){
    File() 
    
  }
  def File(){
    val data=InFile("student.txt")
    var student= ofDim[String](4,2)
    val loop = new Breaks;
    var time:Int=0
    for(i<-data){
      var text=String.valueOf(i);
      var text2=text.split(" ")
      loop.breakable{
      for(j<- 0 to student.size-1){
        if(student(j)(0)==null){student(j)(0)=text2(0);student(j)(1)=text2(1);loop.break;}
        else{
          if(text2(0)==student(j)(0)){student(j)(1)=String.valueOf(student(j)(1).toInt+text2(1).toInt);time+=1}
        }
      }}
    }
    for(j<-0 to 3){
     student(j)(1)=String.valueOf(student(j)(1).toDouble/3)
    }
    outFile(student)
  }
  def InFile(path:String) : Array[String] ={
    val source = Source.fromFile(path, "UTF-8")
    val lines = source.getLines().toArray
    return lines
     
   }
   def outFile(data:Array[Array[String]]){
    val writer = new PrintWriter(new File("avg.txt"))
    for(i <-0 to data.size-1)
    writer.println(data(i)(0)+" "+data(i)(1))
    writer.close()
   }
}

posted @ 2020-02-07 15:23 星辰° 阅读(591) 评论(0) 编辑收藏举报

刷新页面返回顶部

星辰°

学习进度（9）

1 ．spark-shell 交互式编程

2. 编写独立应用程序实现数据去重

3. 编写独立应用程序实现求平均值问题

公告