spark记录

Filtering multiple values in multiple columns:

In the case where you're pulling data from a database (Hive or SQL type db for this example) and need to filter on multiple columns, it might just be easier to load the table with the first filter, then iterate your filters through the RDD (multiple small iterations is the encouraged way of Spark programming):

{
    import org.apache.spark.sql.hive.HiveContext
    val hc = new HiveContext(sc)

    val first_data_filter = hc.sql("SELECT col1,col2,col2 FROM tableName WHERE col3 IN ('value_1', 'value_2', 'value_3)")
    val second_data_filter = first_data_filter.filter(rdd => rdd(1) == "50" || rdd(1) == "20")
    val final_filtered_data = second_data_filter.filter(rdd => rdd(0) == "1500")

}

https://segmentfault.com/a/1190000002614456

posted @ 2016-07-18 17:08 小毛驴阅读(121) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· C# 13 中的新增功能实操
· Ollama本地部署大模型总结
· 2025成都.NET开发者Connect圆满结束
· langchain0.3教程：从0到1打造一个智能聊天机器人
· 用一种新的分类方法梳理设计模式的脉络

公告

昵称：小毛驴
园龄： 14年11个月
粉丝： 22
关注： 28

2025年3月

日

一

二

三

四

五

六

spark记录

公告

搜索

常用链接

我的标签

随笔档案 (764)

阅读排行榜

评论排行榜

推荐排行榜

最新评论