sparksql的自定义函数
自定义函数
package com.ruozedata.SparkProjectJob import org.apache.spark.sql.SparkSession object FunctionApp { def main(args: Array[String]): Unit = { val spark =SparkSession.builder()// .master("local[2]")// .appName("AnalyzerTrain")// .getOrCreate() import spark.implicits._ val likeDF= spark.sparkContext.parallelize(List("17er\truoze,j哥,星星,小海", "老二\tzwr,17er", "小海\t苍老师,波老师")) .map(x => { val fileds = x.split("\t") Stu(fileds(0).trim, fileds(1).trim) } ).toDF() spark.udf.register("like_count" ,(like:String)=>like.split(",").size) //注册函数like_count
//方式一:
likeDF.createOrReplaceTempView("info")//DF通过createOrReplaceTempView注册成临时表info
spark.sql("select name,like,like_count(like) num from info").show() spark.stop() }
case class Stu(name: String, like: String) }
运行结果
+------+------------------------+-----+
|name| like |num|
+------+------------------------+-----+
|17er |ruoze,j哥,星星,小海| 4|
| 老二| zwr,17er | 2 |
| 小海| 苍老师,波老师 | 2 |
+----+--------------+----------------+
定义了每个人喜欢的人的个数的函数;以上的是定义函数以后通过sql来使用的,那如何通过API来使用呢?看下面的代码
package com.ruozedata.SparkProjectJob import org.apache.spark.sql.SparkSession object FunctionApp { def main(args: Array[String]): Unit = { val spark =SparkSession.builder()// .master("local[2]")// .appName("AnalyzerTrain")// .getOrCreate() import spark.implicits._ val likeDF= spark.sparkContext.parallelize(List("17er\truoze,j哥,星星,小海", "老二\tzwr,17er", "小海\t苍老师,波老师")) .map(x => { val fileds = x.split("\t") Stu(fileds(0).trim, fileds(1).trim) } ).toDF() spark.udf.register("like_count" ,(like:String)=>like.split(",").size) //注册函数like_count //likeDF.createOrReplaceTempView("info")//DF通过createOrReplaceTempView注册成临时表info //方式二:注册函数以后直接就可以当成内置函数使用的模式来使用
likeDF.selectExpr("name","like","like_count(like) as cnt").show()
case class Stu(name: String, like: String) }
运行结果
+----+--------------------------+--------+
|name| like |cnt |
+------+--------------------------+------+
|17er | ruoze,j哥,星星,小海 | 4 |
| 老二 | zwr,17er | 2 |
| 小海 | 苍老师,波老师 | 2 |
+------+--------------------------+------+
其实方式二仅仅是半API,纯正的API见下
package com.ruozedata.SparkProjectJob import org.apache.spark.sql.{SparkSession, functions} object FunctionApp { def main(args: Array[String]): Unit = { val spark =SparkSession.builder()// .master("local[2]")// .appName("AnalyzerTrain")// .getOrCreate() import spark.implicits._ val likeDF= spark.sparkContext.parallelize(List("17er\truoze,j哥,星星,小海", "老二\tzwr,17er", "小海\t苍老师,波老师")) .map(x => { val fileds = x.split("\t") Stu(fileds(0).trim, fileds(1).trim) } ).toDF()
//方式三: val like_count =functions.udf((like:String)=>like.split(",").size) //spark.udf.register("like_count" ,(like:String)=>like.split(",").size) likeDF.select($"name",$"like",like_count($"like").alias("cnt") ).show()
//like_count 需要通过val来定义,这样在select里边使用的时候才不会爆红
spark.stop() } case class Stu(name: String, like: String) }
运行结果
+----+--------------------------+--------+
|name| like |cnt |
+------+--------------------------+------+
|17er | ruoze,j哥,星星,小海 | 4 |
| 老二 | zwr,17er | 2 |
| 小海 | 苍老师,波老师 | 2 |
+------+--------------------------+------+