8、SparkSQL综合作业

综合练习:学生课程分数

网盘下载sc.txt文件,创建RDD,并转换得到DataFrame。

 

RDD操作见https://www.cnblogs.com/Kiranar/p/16138625.html

同一操作插入sql/dataframe操作图,sql在下,dataframe在上

 

dataframe:

 sql:

 

分别用DataFrame操作和spark.sql执行SQL语句实现以下数据分析:

  • 持久化 scm.cache()
  • 总共有多少学生?map(), distinct(), count()
  • 开设了多少门课程?
  •  

 

 

  • 生成(姓名,课程分数)键值对RDD,观察keys(),values()
  • 每个学生选修了多少门课?map(), countByKey()
  •  

     

  • 每门课程有多少个学生选?map(), countByValue() 
  •  

     

  • 有多少个100分?
  •  

     

  • Tom选修了几门课?每门课多少分?filter(), map() RDD
  • Tom选修了几门课?每门课多少分?map(),lookup()  list
  • Tom的成绩按分数大小排序。filter(), map(), sortBy()
  •  

  • Tom的平均分。map(),lookup(),mean()

 

 

 

    • 生成(姓名课程,分数)RDD,观察keys(),values()
    • 每个分数+20平时分。

       

       



    • 求每门课的选修人数及平均分
      • lookup(),np.mean()实现
      • reduceByKey()和collectAsMap()实现
      • combineByKey(),map(),round()实现,确到2位小数
      • 比较几种方法的异同。
      • datadrame:

         

      •  

      • sql:

         

 

 

 

综合练习:学生课程分数

网盘下载sc.txt文件,分别创建RDD、DataFrame和临时表/视图;

分别用RDD操作、DataFrame操作和spark.sql执行SQL语句实现以下数据分析:

总共有多少学生?

总共开设了多少门课程?

每个学生选修了多少门课?

 

 

每门课程有多少个学生选?

 

 

每门课程>95分的学生人数

 

 

课程'Python'有多少个100分?

 

 

Tom选修了几门课?每门课多少分?

 

 

# 创建RDD
url = "file:///home/hadoop/sc.txt"
rdd = sc.textFile(url).map(lambda x:x.split(","))
scm = rdd.map(lambda a:[a[0],a[1],int(a[2])])
scm.cache()
scm.take(3)

# 并转换得到DataFrame
from pyspark.sql import Row
df = spark.createDataFrame(rdd.map(lambda x:Row(name=x[0],course=x[1],grade=int(x[2]))))
# 观察 df
df.show()
df.printSchema()
# 注册为临时表
df.createOrReplaceTempView("sc")
# 观察临时表
spark.sql("select * from sc").show()

# 导包
from pyspark.sql.functions import *

# 题目

# 1.总共有多少学生?
scm.map(lambda a:a[0]).distinct().count()
df.select(df.name).distinct().count()
spark.sql("select count(distinct name) from sc").show()

# 2.总共开设了哪些课程?
scm.map(lambda a:a[1]).distinct().collect()
df.select(df.course).distinct().show()
spark.sql("select distinct course from sc").show()

# 3.每个学生选修了多少门课?
name = scm.map(lambda a:(a[0],(a[1],a[2])))
name.countByKey()

df.groupBy(df.name).count().show()
spark.sql("select name, count(course) from sc group by name").show()

# 4.每门课程有多少个学生选?
name.values().countByKey()
df.groupBy(df.course).count().show()
spark.sql("select course, count(name) from sc group by course").show()

# 5.每门课程>95分的学生人数
name.values().filter(lambda a:a[1]>95).countByKey()
df.filter(df.grade>95).groupBy(df.course).count().show()
spark.sql("select course,count(*) from sc where grade>95 group by course").show()

# 6.课程'Python'有多少个100分?
scm.filter(lambda a:a[1]=='Python').filter(lambda a:a[2]==100).count()
df.filter((df.course=='Python') & (df.grade==100)).count()
spark.sql("select count(*) from sc where course='Python' and grade=100").show()

# 改题目了,这个是旧题目
# 7.Tom选修了几门课?每门课多少分?
name.lookup('Tom')
len(name.lookup('Tom'))

df.select(df.course, df.grade).filter(df.name=='Tom').count()
df.select(df.course, df.grade).filter(df.name=='Tom').show()

spark.sql("select count(*) from sc where name='Tom'").show()
spark.sql("select course, grade from sc where name='Tom'").show()

# 7.Tom哪几门课不及格?
scm.filter(lambda a:a[0]=='Tom').filter(lambda a:a[2]<60).map(lambda a:a[1]).collect()
df.select(df.course).filter((df.name=='Tom')&(df.grade<60)).show()
spark.sql("select course from sc where name='Tom' and grade<60").show()

# 8.Tom的成绩按分数大小排序。
name.filter(lambda a:a[0]=='Tom').values().sortBy(lambda a:a[1],False).map(lambda a:a[1]).collect()
df.select(df.grade).orderBy(df.grade).filter(df.name=='Tom').show()
spark.sql("select grade from sc where name='Tom' order by grade asc").show()

# 9.Tom选修了哪几门课?
scm.filter(lambda a:a[0] == 'Tom').map(lambda a:a[1]).collect()
df.filter(df.name == 'Tom').select(df.course).show()
spark.sql('select course from sc where name = "Tom"').show()

# 10.Tom的平均分。
import numpy as np
np.mean(scm.filter(lambda a:a[0]=='Tom').map(lambda a:a[2]).collect())

df.filter(df.name=='Tom').agg({'grade':'mean'}).show()
spark.sql("select avg(grade) from sc where name='Tom'").show()

# 11.'OperatingSystem'不及格人数
scm.filter(lambda a:a[1] == 'OperatingSystem' and a[2] < 60).count()
df.filter(df['course'] == 'OperatingSystem').filter(df['grade'] < 60).count()
spark.sql('select count(grade) from sc where course = "OperatingSystem" and grade < 60').show()

# 12.'OperatingSystem'平均分
np.mean(scm.filter(lambda a:a[1] == 'OperatingSystem').map(lambda a:(a[1],a[2])).values().collect())
df.filter(df['course'] == 'OperatingSystem').agg({'grade':'avg'}).show()
spark.sql('select AVG(grade) from sc where course = "OperatingSystem"').show()

# 13.'OperatingSystem'90分以上人数
scm.filter(lambda a:a[1] == 'OperatingSystem' and a[2] > 90).count()
df.filter(df['course'] == 'OperatingSystem').filter(df['grade'] > 90).count()
spark.sql('select count("OperatingSystem > 90") from sc where course = "OperatingSystem" and grade >90').show()

# 14.'OperatingSystem'前3名
scm.filter(lambda a:a[1] == 'OperatingSystem').sortBy(lambda a:a[2],False).map(lambda a:(a[0])).take(3)
df.filter(df['course'] == 'OperatingSystem').sort(-df['grade']).select(df.name).limit(3).show()
spark.sql('select name from sc where course = "OperatingSystem" order by grade desc limit 3').show()

# 15.每个分数按比例+20平时分。
scm.map(lambda a:((a[0],a[1]),a[2])).take(5)
scm.map(lambda a:((a[0],a[1]),a[2])).mapValues(lambda v:v*0.8+20).take(5)
df.select(df.name, df.course, df.grade).show()
df.select(df.name, df.course, df.grade*0.8+20).show()
spark.sql("select name, course, grade from sc").show()
spark.sql("select name, course, grade*0.8+20 from sc").show()

# 16.求每门课的平均分
rdd.map(lambda a:(a[1], (int(a[2]), 1))).reduceByKey(lambda a,b:[a[0]+b[0], a[1]+b[1]]).map(lambda a:[a[0], a[1][0]/a[1][1]]).collect()
df.groupBy(df.course).agg({'grade':'avg'}).show()
spark.sql('select course, AVG(grade) from sc group by course').show()

# 17.选修了7门课的有多少个学生?
rdd.map(lambda a:(a[0], 1)).reduceByKey(lambda a,b:a+b).filter(lambda a:a[1]==7).count()
#
df.groupBy('name').count().filter(col('count') == 7).select(count('name')).show()
#
spark.sql(
  "select count('name')\
  from (select name, count('name') num from sc where group by name)\
  where num = 7"
).show()

# 18.每门课大于95分的学生数
# 跟第5题一样
# scm.filter(lambda a:a[2]>95).map(lambda a:[a[0], 1]).reduceByKey(lambda a,b:a+b).join(scm.map(lambda a:[a[0], 1]).reduceByKey(lambda a,b:a+b)).filter(lambda a: a[1][0] == a[1][1]).count()
# #
# df.filter(df.grade > 95).groupBy(df.name).agg({'name':'count'}).select('name', col('count(name)').alias('c1'))\
# .join(df.groupBy(df.name).agg({'name':'count'}).select('name', col('count(name)').alias('c2')),'name','left')\
# .filter(col('c1') == col('c2')).select(count('name'))\
# .show()
# #
# spark.sql(
#   "select name, count(name) c2\
#   from sc group by name"
# ).createOrReplaceTempView("b")

# spark.sql(
#   "select name, count(name) c1\
#   from sc where grade>95 group by name"
# ).createOrReplaceTempView("a")

# spark.sql(
#   "select count(a.name)\
#   from a join b on a.name = b.name where c1 = c2"
# ).show()

# 19.每门课的选修人数、平均分、不及格人数、通过率
# rdd
# 选修人数
rdd.map(lambda a:(a[1], 1)).reduceByKey(lambda a,b:a+b).collect()
# 平均分
rdd.map(lambda a:(a[1], (int(a[2]), 1))).reduceByKey(lambda a,b:[a[0]+b[0], a[1]+b[1]]).map(lambda a:[a[0], a[1][0]/a[1][1]]).collect()
# 不及格人数
rdd.map(lambda a:(a[1], a[2])).filter(lambda a:int(a[1]) < 60).map(lambda a:(a[0], 1)).reduceByKey(lambda a,b:a+b).collect()
# 通过率
rdd.map(lambda a:(a[1], 1)).reduceByKey(lambda a,b:a+b).join(
  rdd.map(lambda a:(a[1], a[2])).filter(lambda a:int(a[1]) < 60).map(lambda a:(a[0], 1)).reduceByKey(lambda a,b:a+b)
).map(lambda a:(a[0], 1-(a[1][1]/a[1][0]))).collect()
# df
df1 = df.groupBy(df.course).agg({'grade':'mean'}).select(
  'course', bround('avg(grade)',scale=2).alias('平均分')
).join(
  df.groupBy(df.course).count()\
  .select(
    'course', col('count').alias('选修人数')
  ),"course","left"
).join(
  df.filter(df.grade<60).groupBy(df.course).count()\
  .select(
    'course', col('count').alias('不及格人数')
  ),"course","left"
)

df1.select(
  'course', '平均分', '选修人数', '不及格人数', 1-df1['不及格人数']/df1['选修人数']
).select(
  'course', '平均分', '选修人数', '不及格人数', col("(1 - (不及格人数 / 选修人数))").alias("通过率")
).show()
# sql
spark.sql(
  "select course, count(name) as count, round(avg(grade), 2) as avg\
  from sc group by course"
).show()

spark.sql(
  "select course, count(name) as failed\
  from sc where grade<60 group by course"
).createOrReplaceTempView("b")
spark.sql(
  "select course, count(name) as count\
  from sc group by course"
).createOrReplaceTempView("a")
spark.sql(
  "select a.course, b.failed, 1-b.failed/a.count as pass\
  from a join b on a.course = b.course"
).show()

# 20.优秀、良好、通过和不合格各有多少人?
scm.map(lambda a:a[2]).filter(lambda a:a>=90).count()
scm.map(lambda a:a[2]).filter(lambda a:a>=80 and a<90).count()
scm.map(lambda a:a[2]).filter(lambda a:a>=60 and a<80).count()
scm.map(lambda a:a[2]).filter(lambda a:a<60).count()
#
df.filter(df.grade>=90).select(count(df.grade)).show()
df.filter((df.grade>=80) & (df.grade<90)).select(count(df.grade)).show()
df.filter((df.grade>=60) & (df.grade<80)).select(count(df.grade)).show()
df.filter(df.grade<60).select(count(df.grade)).show()
#
spark.sql("select count(name) from sc where grade>=90").show()
spark.sql("select count(name) from sc where grade>=80 and grade<90").show()
spark.sql("select count(name) from sc where grade>=60 and grade<80").show()
spark.sql("select count(name) from sc where grade<60").show()

# 21.同时选修了 DataStructure 和 DataBase 的学生
scm.map(lambda a:[a[0], a[1]]).groupByKey().map(lambda x:(x[0], list(x[1]))).filter(lambda a:'DataStructure' in a[1] and 'DataBase' in a[1]).count()
#
df.select(col('name'), col('course').alias('ac')).join(
  df.select(col('name'), col('course').alias('bc')),"name","inner"
).filter(col('ac')=='DataBase').filter(col('bc')=='DataStructure')\
.select(col('name')).distinct().select(count(col('name'))).show()
#
spark.sql(
  "select count(distinct a.name)\
  from sc a join (select name, course from sc) b on a.name = b.name\
  where a.course = 'DataStructure' and b.course = 'DataBase'"
).show()

# 22.选修了 DataStructure 但没有选修 DataBase 的学生
scm.map(lambda a:[a[0], a[1]]).groupByKey().map(lambda x:(x[0], list(x[1]))).filter(lambda a:'DataStructure' in a[1] and 'DataBase' not in a[1]).count()
#
df2=df.filter(col('course') == 'DataBase').select(col('name').alias('b'))
df1=df.filter(col('course') == 'DataStructure').select(col('name').alias('a'))
df1.join(df2,[df1.a==df2.b],'left').filter(col('b').isNull()).select(count(col('a'))).show()
#
spark.sql(
  "select distinct name\
  from sc where course = 'DataStructure'"
).createOrReplaceTempView("a")
spark.sql(
  "select distinct name\
  from sc where course = 'DataBase'"
).createOrReplaceTempView("b")
spark.sql(
  "select count(a.name)\
  from a left join b on a.name = b.name\
  where b.name is null"
).show()

# 23.选修课程数少于3门的同学
rdd.map(lambda a:(a[0], 1)).reduceByKey(lambda a,b:a+b).filter(lambda a:a[1]<3).map(lambda a:a[0]).collect()
df.groupBy('name').count().filter(col('count') < 3).select(col('name')).show()
#
spark.sql(
  "select name\
  from (select name, count('name') num from sc where group by name)\
  where num < 3"
).show()

# 24.选修6门及以上课程数的同学
rdd.map(lambda a:(a[0], 1)).reduceByKey(lambda a,b:a+b).filter(lambda a:a[1]>=6).map(lambda a:a[0]).collect()
df.groupBy('name').count().filter(col('count') > 6).select(col('name')).show()
#
spark.sql(
  "select name\
  from (select name, count('name') num from sc where group by name)\
  where num > 6"
).show()

# 25.查询平均成绩大于等于60分的姓名和平均成绩
rdd.map(lambda a:(a[0], (int(a[2]), 1))).reduceByKey(lambda a,b:[a[0]+b[0], a[1]+b[1]]).map(lambda a:[a[0], a[1][0]/a[1][1]]).filter(lambda a:a[1]>=60).collect()
#
df.groupBy(df.name).agg({'grade':'mean'}).select(
  'name', bround('avg(grade)',scale=2).alias('平均分')
).filter(col('平均分')>=60).show()
#
spark.sql(
  "select name, round(avg(grade), 2) as avg\
  from sc group by name having avg >= 60"
).show()

# 26.找出平均分最高的10位同学
rdd.map(lambda a:(a[0], (int(a[2]), 1))).reduceByKey(lambda a,b:[a[0]+b[0], a[1]+b[1]]).map(lambda a:[a[0], a[1][0]/a[1][1]]).sortBy(lambda a:a[1],False).take(10)
#
df.groupBy(df.name).agg({'grade':'mean'}).select(
  'name', bround('avg(grade)',scale=2).alias('平均分')
).sort(-col('平均分')).limit(10).show()
#
spark.sql(
  "select name, round(avg(grade), 2) as avg\
  from sc group by name\
  order by avg desc limit 10"
).show()

 

posted @ 2022-05-25 09:46  粉柠檬  阅读(240)  评论(0编辑  收藏  举报