spark scala 删除所有列全为空值的行

删除表中全部为NaN的行

df.na.drop("all")

删除表任一列中有NaN的行

df.na.drop("any")

示例:

复制代码
scala> df.show
+----+-------+--------+-------------------+-----+----------+
|  id|zipcode|    type|               city|state|population|
+----+-------+--------+-------------------+-----+----------+
|   1|    704|STANDARD|               null|   PR|     30100|
|   2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|   3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|   4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|   5|  76177|STANDARD|               null|   TX|      null|
|null|   null|    null|               null| null|      null|
|   7|  76179|STANDARD|               null|   TX|      null|
+----+-------+--------+-------------------+-----+----------+


scala> df.na.drop("all").show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
|  7|  76179|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+


scala> df.na.drop().show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+


scala> df.na.drop("any").show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+
复制代码

删除给定列为Null的行:

val nameArray = sparkEnv.sc.textFile("/master/abc.txt").collect()
val df = df.na.drop("all", nameArray.toList.toArray)

df.na.drop(Seq("population","type"))

删除指定列为Na的行(删除列create_time为Na的行)

.na.drop("all", Seq("create_time"))

函数原型:

复制代码
def drop(): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values.

def drop(how: String): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values.
If how is "any", then drop rows containing any null or NaN values. If how is "all", then drop rows only if every column is null or NaN for that row.

def drop(how: String, cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(how: String, cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

def drop(cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.
复制代码

更多函数原型:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions


参考:
N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
示例代码及数据集:https://github.com/spark-examples/spark-scala-examples csv路径:src/main/resources/small_zipcode.csv
https://www.jianshu.com/p/39852729736a

posted @   船长博客  阅读(2202)  评论(0编辑  收藏  举报
编辑推荐:
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· .NET10 - 预览版1新功能体验(一)
永远相信美好的事情即将发生!
点击右上角即可分享
微信分享提示