spark记录

 

Filtering multiple values in multiple columns:

In the case where you're pulling data from a database (Hive or SQL type db for this example) and need to filter on multiple columns, it might just be easier to load the table with the first filter, then iterate your filters through the RDD (multiple small iterations is the encouraged way of Spark programming):

{
    import org.apache.spark.sql.hive.HiveContext
    val hc = new HiveContext(sc)

    val first_data_filter = hc.sql("SELECT col1,col2,col2 FROM tableName WHERE col3 IN ('value_1', 'value_2', 'value_3)")
    val second_data_filter = first_data_filter.filter(rdd => rdd(1) == "50" || rdd(1) == "20")
    val final_filtered_data = second_data_filter.filter(rdd => rdd(0) == "1500")

}

https://segmentfault.com/a/1190000002614456

posted @   小毛驴  阅读(121)  评论(0编辑  收藏  举报
编辑推荐:
· 理解Rust引用及其生命周期标识(下)
· 从二进制到误差:逐行拆解C语言浮点运算中的4008175468544之谜
· .NET制作智能桌面机器人:结合BotSharp智能体框架开发语音交互
· 软件产品开发中常见的10个问题及处理方法
· .NET 原生驾驭 AI 新基建实战系列:向量数据库的应用与畅想
阅读排行:
· C# 13 中的新增功能实操
· Ollama本地部署大模型总结
· 2025成都.NET开发者Connect圆满结束
· langchain0.3教程:从0到1打造一个智能聊天机器人
· 用一种新的分类方法梳理设计模式的脉络
点击右上角即可分享
微信分享提示