Spark DataFrame NOT IN实现方法

来源:https://sqlandhadoop.com/spark-dataframe-in-isin-not-in/

 

摘要:To use the condition as “NOT IN”, you can use negation (!) before the column name in the previous isin query.

eg. df_pres.filter(!$"pres_bs".isin("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()


Spark Dataframe IN-ISIN-NOT IN

IN or NOT IN conditions are used in FILTER/WHERE or even in JOINS when we have to specify multiple possible values for any column. If the value is one of the values mentioned inside “IN” clause then it will qualify. It is opposite for “NOT IN” where the value must not be among any one present inside NOT IN clause.
So let’s look at the example for IN condition

 
scala> df_pres.filter($"pres_bs" in ("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------+
|           pres_name|  pres_dob| pres_bs|
+--------------------+----------+--------+
|    Martin Van Buren|1782-12-05|New York|
|    Millard Fillmore|1800-01-07|New York|
|    Ulysses S. Grant|1822-04-27|    Ohio|
| Rutherford B. Hayes|1822-10-04|    Ohio|
|   James A. Garfield|1831-11-19|    Ohio|
|   Benjamin Harrison|1833-08-20|    Ohio|
|    William McKinley|1843-01-29|    Ohio|
|  Theodore Roosevelt|1858-10-27|New York|
| William Howard Taft|1857-09-15|    Ohio|
|   Warren G. Harding|1865-11-02|    Ohio|
|Franklin D. Roose...|1882-01-30|New York|
|Dwight D. Eisenhower|1890-10-14|   Texas|
|   Lyndon B. Johnson|1908-08-27|   Texas|
|        Donald Trump|1946-06-14|New York|
+--------------------+----------+--------+

Note: “in” method is not available in Spark 2.0. So prefer method is “isin”

Other way of writing it could be and the one which I prefer is by using isin function.

scala> df_pres.filter($"pres_bs".isin("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------+
|           pres_name|  pres_dob| pres_bs|
+--------------------+----------+--------+
|    Martin Van Buren|1782-12-05|New York|
|    Millard Fillmore|1800-01-07|New York|
|    Ulysses S. Grant|1822-04-27|    Ohio|
| Rutherford B. Hayes|1822-10-04|    Ohio|
|   James A. Garfield|1831-11-19|    Ohio|
|   Benjamin Harrison|1833-08-20|    Ohio|
|    William McKinley|1843-01-29|    Ohio|
|  Theodore Roosevelt|1858-10-27|New York|
| William Howard Taft|1857-09-15|    Ohio|
|   Warren G. Harding|1865-11-02|    Ohio|
|Franklin D. Roose...|1882-01-30|New York|
|Dwight D. Eisenhower|1890-10-14|   Texas|
|   Lyndon B. Johnson|1908-08-27|   Texas|
|        Donald Trump|1946-06-14|New York|
+--------------------+----------+--------+

To use the condition as “NOT IN”, you can use negation (!) before the column name in the previous isin query.

scala> df_pres.filter(!$"pres_bs".isin("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Adams|1735-10-30|       Massachusetts|
|    Thomas Jefferson|1743-04-13|            Virginia|
|       James Madison|1751-03-16|            Virginia|
|        James Monroe|1758-04-28|            Virginia|
|   John Quincy Adams|1767-07-11|       Massachusetts|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|William Henry Har...|1773-02-09|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|       James K. Polk|1795-11-02|      North Carolina|
|      Zachary Taylor|1784-11-24|            Virginia|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|      James Buchanan|1791-04-23|        Pennsylvania|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
|    Grover Cleveland|1837-03-18|          New Jersey|
|      Woodrow Wilson|1856-12-28|            Virginia|
|     Calvin Coolidge|1872-07-04|             Vermont|
+--------------------+----------+--------------------+
only showing top 20 rows
posted @   梦醒江南·Infinite  阅读(1766)  评论(0编辑  收藏  举报
编辑推荐:
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
阅读排行:
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
· 字符编码:从基础到乱码解决
点击右上角即可分享
微信分享提示