最简大数据开发环境(一) Scala+Spark版

  

最简大数据开发环境(一) Scala+Spark版

最简大数据开发环境(二) Python+Spark版

  是否有像我一样,大数据架构原理看了千百遍,可每次想动手操练一番,就被Hadoop、Hive、Storm、Spark、Flink……一大堆的环境配置搞得晕头转向。那么有没有不用配置或者尽可能少配置的方案?

  我的方案是sbt+vscode,当然,需要Java开发环境。

  JAVA环境配置

  sbt安装

  VSCode下载

  以上三项配置好后,我就演示一下怎么一步一步构建一个Spark+Mongodb项目。对该方案方案感兴趣的可以看下唐建法的报告《Mongodb+Spark完整大数据解决方案》。

  • 打开VSCode,可以看到欢迎界面

  

  • Ctrl+Shift+X打开扩展插件选项,输入scala,安装Scala扩展插件

  

 

 

  •  Ctrl+Shift+`,或工具栏->终端->新终端,创建新终端

  

 

  • 切换到自己常用文件夹,我的是E:\project
  • #创建项目文件夹
    mkdir mongodbspark
  • cd .\mongodbspark\
  • #进入项目开发界面,会新打开一个VSCode窗口
    code .

 

   

 

 

  • #新建终端
    Ctrl+Shift+`
    #测试sbt是否配置好
    sbt --version

  • #添加build.sbt文件
    ni build.sbt

    打开build.sbt文件,并添加以下代码

  • //项目名称
    name :="mongodbspark"
    //项目版本
    version :="1.0"
    //scala版本
    scalaVersion :="2.12.12"
    libraryDependencies ++=Seq(
        "org.apache.spark"%%"spark-core"%"3.1.2",
        "org.apache.spark"%%"spark-sql"%"3.1.2",
        "org.mongodb.spark"%%"mongo-spark-connector"%"3.0.1"
    )

    Ctrl+S保存build.sbt,从右下弹出框选择Import changes,稍等几分钟待依赖项被加载到项目中

  

  • #创建文件夹
    mkdir .\src\main\scala\
    #添加Main.scala文件 ni Main.scala

    #Main.scala添加以下代码 package sparkmongo object Main extends App{ println(45+25) } #终端执行以下命令
    sbt reload sbt run

     

  •  

  • 到这里我们的scala项目基本框架已经搭建好,目录结构如下图

  

  Spark与Mongodb连接需要用到Mongodb官方的spark-connector

  本地测试文档结构如下

  

   

 

 

 

  •  Main.scala文件里的代码替换为以下代码  
package sparkmongo

//MongodbSaprk-Connector
import com.mongodb.spark._
import org.apache.spark.sql.SparkSession

object main extends App{
    //在spark创建时配置mongodb连接信息
    //手打的同学要注意别把uri打成url,很容易混淆
    val sparkSession=SparkSession.builder()
        .master("local")
        .appName("MongodbSpark")
        .config("spark.mongodb.input.uri","mongodb://127.0.0.1/Vocen_DWD.invla?readPreference=secondaryPreferred")
        .config("spark.mongodb.output.uri","mongodb://127.0.0.1/Vocen_DWD.test")
        .getOrCreate()
    val df1=MongoSpark.load(sparkSession)//加载数据
    df1.createOrReplaceTempView("invla")//缓存到临时表invla
    //用sql查询mongodb表中数据,并显示前5条
    sparkSession.sql("SELECT COMPANY,LA001,LA004,LA009,count FROM invla WHERE COMPANY='PTYW' AND count>3").show(5)
    sparkSession.stop()
}
  • 在终端运行命令: sbt run,运行结果如下图

  

   Mongodb与Spark擦出的真是爱的火花,即克服了Mongodb不能直接用SQL,又避免了HDFS不能用复杂索引的不足。

 

  有人好奇那句SQL语句到了mongodb里怎么执行的吗?一起看下mongodb的执行记录:

//前两行是网络连接信息
{"t":{"$date":"2021-10-13T17:22:52.925+08:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:49698","connectionId":8,"connectionCount":6}} {"t":{"$date":"2021-10-13T17:22:52.925+08:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn8","msg":"client metadata","attr":{"remote":"127.0.0.1:49698","client":"conn8","doc":{"driver":{"name":"mongo-java-driver|sync|mongo-spark","version":"4.0.5|3.0.1"},"os":{"type":"Windows","name":"Windows 10","architecture":"amd64","version":"10.0"},"platform":"Java/Oracle Corporation/1.8.0_291-b10|Scala/2.12.11:Spark/3.1.2"}}} //随机取样100行数据,以确定数据结构
{
"t":{"$date":"2021-10-13T17:22:53.800+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn8","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"aggregate":"invla","pipeline":[{"$sample":{"size":1000}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"MULTI_ITERATOR","cursorid":7176739098596969558,"keysExamined":0,"docsExamined":0,"numYields":19,"nreturned":101,"reslen":14611,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":21}},"Global":{"acquireCount":{"r":21}},"Database":{"acquireCount":{"r":21}},"Collection":{"acquireCount":{"r":21}},"Mutex":{"acquireCount":{"r":2}}},"storage":{"data":{"bytesRead":1543572,"timeReadingMicros":110058}},"protocol":"op_msg","durationMillis":137}} //执行了sql中的where和select获取数据,把sql语句转换成bson,返回的是全部数据,显示前5条的命令是在spark中执行
{
"t":{"$date":"2021-10-13T17:22:57.356+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn8","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"aggregate":"invla","pipeline":[{"$match":{"$and":[{"COMPANY":"PTYW"},{"count":{"$gt":3}}]}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { COMPANY: 1 }","keysExamined":910813,"docsExamined":910813,"cursorExhausted":true,"numYields":910,"nreturned":1,"queryHash":"FB8A4A00","planCacheKey":"E11D0BC7","reslen":128,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":912}},"Global":{"acquireCount":{"r":912}},"Database":{"acquireCount":{"r":912}},"Collection":{"acquireCount":{"r":912}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"protocol":"op_msg","durationMillis":1125}}

 

  • 如果我们在sql中用了聚合语句,聚合操作会不会在Mongodb中执行?来一起试验下看看:
  • 将Main.scala文件代码替换成下面的,我将sql改造成group by语句,仍然显示前5,然后sbt run走起:
  • package sparkmongo
    
    //MongodbSaprk-Connector
    import com.mongodb.spark._
    import org.apache.spark.sql.SparkSession
    
    object main extends App{
        //在spark创建时配置mongodb连接信息
        //手打的同学要注意别把uri打成url,很容易混淆
        val sparkSession=SparkSession.builder()
            .master("local")
            .appName("MongodbSpark")
            .config("spark.mongodb.input.uri","mongodb://127.0.0.1/Vocen_DWD.invla?readPreference=secondaryPreferred")
            .config("spark.mongodb.output.uri","mongodb://127.0.0.1/Vocen_DWD.test")
            .getOrCreate()
        val df1=MongoSpark.load(sparkSession)//加载数据
        df1.createOrReplaceTempView("invla")//缓存到临时表invla
        //用sql查询mongodb表中数据,并显示前5条
        sparkSession.sql("SELECT COMPANY,LA001,LA009,COUNT(LA001),SUM(count) sum FROM invla WHERE COMPANY='PTYW' AND count>3 " +
          "GROUP BY COMPANY,LA001,LA009").show(5)
        sparkSession.stop()
    }

     

    //前两行与上一个的相同都是网络连接记录
    {"t":{"$date":"2021-10-13T18:21:20.706+08:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:49424","connectionId":14,"connectionCount":6}}
    {"t":{"$date":"2021-10-13T18:21:20.707+08:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn14","msg":"client metadata","attr":{"remote":"127.0.0.1:49424","client":"conn14","doc":{"driver":{"name":"mongo-java-driver|sync|mongo-spark","version":"4.0.5|3.0.1"},"os":{"type":"Windows","name":"Windows 10","architecture":"amd64","version":"10.0"},"platform":"Java/Oracle Corporation/1.8.0_291-b10|Scala/2.12.11:Spark/3.1.2"}}}
    //获取where过滤后的总条数
    {
    "t":{"$date":"2021-10-13T18:21:25.328+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn14","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"aggregate":"invla","pipeline":[{"$match":{"$and":[{"COMPANY":"PTYW"},{"count":{"$gt":3}}]}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { COMPANY: 1 }","keysExamined":910813,"docsExamined":910813,"cursorExhausted":true,"numYields":910,"nreturned":1,"queryHash":"FB8A4A00","planCacheKey":"E11D0BC7","reslen":128,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":912}},"Global":{"acquireCount":{"r":912}},"Database":{"acquireCount":{"r":912}},"Collection":{"acquireCount":{"r":912}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"protocol":"op_msg","durationMillis":1101}} //返回where过滤后,select选取的字段,可以明确的一点是,group by没在mongodb中执行,而在spark中执行
    {
    "t":{"$date":"2021-10-13T18:21:27.143+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn14","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"getMore":5143393766922072670,"collection":"invla","$db":"Vocen_DWD"},"originatingCommand":{"aggregate":"invla","pipeline":[{"$match":{"$and":[{"COMPANY":{"$exists":true,"$ne":null}},{"count":{"$exists":true,"$ne":null}},{"COMPANY":"PTYW"},{"count":{"$gt":3}}]}},{"$project":{"COMPANY":1,"LA001":1,"LA009":1,"count":1,"_id":0}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { COMPANY: 1 }","cursorid":5143393766922072670,"keysExamined":907750,"docsExamined":907750,"cursorExhausted":true,"numYields":907,"nreturned":18957,"reslen":1538871,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":908}},"Global":{"acquireCount":{"r":908}},"Database":{"acquireCount":{"r":908}},"Collection":{"acquireCount":{"r":908}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1547}}

  再看个表关联的例子,Main.scala代码如下:

package sparkmongo

//MongodbSaprk-Connector
import com.mongodb.spark._
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.config.ReadConfig

object main extends App{
    //在spark创建时配置mongodb连接信息
    //手打的同学要注意别把uri打成url,很容易混淆
    val sparkSession=SparkSession.builder()
        .master("local")
        .appName("MongodbSpark")
        .config("spark.mongodb.input.uri","mongodb://127.0.0.1/Vocen_DWD.invla?readPreference=secondaryPreferred")
        .config("spark.mongodb.output.uri","mongodb://127.0.0.1/Vocen_DWD.test")
        .getOrCreate()
    //加载数据
    val df1=MongoSpark.load(sparkSession)
    val df2=MongoSpark.load(sparkSession,ReadConfig(Map("collection"->"cmsmc"),Some(ReadConfig(sparkSession))))
  //缓存到临时表 df1.createOrReplaceTempView(
"invla") df2.createOrReplaceTempView("cmsmc")   //表invla与cmsmc关联并聚合 val mySql="SELECT la.COMPANY,la.LA001,la.LA009,mc.storageName,COUNT(la.LA001),SUM(la.count) sum " + "FROM invla la " + "LEFT JOIN cmsmc mc ON mc.COMPANY=la.COMPANY AND mc.LA009=la.LA009 " + "WHERE la.COMPANY='PTYW' AND la.count>0 " + "GROUP BY la.COMPANY,la.LA001,la.LA009,mc.storageName" //用sql查询mongodb表中数据,并显示前5条 sparkSession.sql(mySql).show(5) sparkSession.stop() }

  

#mongodb操作日志
//获取表invla符合where条件的数据总数 {"t":{"$date":"2021-10-14T09:16:11.492+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"aggregate":"invla","pipeline":[{"$match":{"$and":[{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { COMPANY: 1 }","keysExamined":910813,"docsExamined":910813,"cursorExhausted":true,"numYields":910,"nreturned":1,"queryHash":"FB8A4A00","planCacheKey":"E11D0BC7","reslen":128,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":912}},"Global":{"acquireCount":{"r":912}},"Database":{"acquireCount":{"r":912}},"Collection":{"acquireCount":{"r":912}},"Mutex":{"acquireCount":{"r":2}}},"storage":{},"protocol":"op_msg","durationMillis":1260}}
//随机采样符合where条件的数据18条 {
"t":{"$date":"2021-10-14T09:16:13.784+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"aggregate":"invla","pipeline":[{"$match":{"$and":[{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$sample":{"size":18}},{"$project":{"_id":1}},{"$sort":{"_id":1}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { COMPANY: 1 }","keysExamined":910813,"docsExamined":910813,"hasSortStage":true,"cursorExhausted":true,"numYields":922,"nreturned":18,"queryHash":"3C43B2DC","planCacheKey":"8AE2943E","reslen":562,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":949}},"Global":{"acquireCount":{"r":949}},"Database":{"acquireCount":{"r":949}},"Collection":{"acquireCount":{"r":949}},"Mutex":{"acquireCount":{"r":27}}},"storage":{},"protocol":"op_msg","durationMillis":2288}} //下面这几个条日志做的是同一件事,即按ID分段拉取数据,这是优化慢查询的一种方法
{
"t":{"$date":"2021-10-14T09:16:16.356+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"getMore":9005002959326612078,"collection":"invla","$db":"Vocen_DWD"},"originatingCommand":{"aggregate":"invla","pipeline":[{"$match":{"_id":{"$lt":{"$oid":"60ed58ba02f942f851230224"}}}},{"$match":{"$and":[{"COMPANY":{"$exists":true,"$ne":null}},{"count":{"$exists":true,"$ne":null}},{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$project":{"COMPANY":1,"LA001":1,"LA009":1,"count":1,"_id":0}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { COMPANY: 1 }","cursorid":9005002959326612078,"keysExamined":910712,"docsExamined":910712,"fromMultiPlanner":true,"cursorExhausted":true,"numYields":910,"nreturned":86663,"reslen":7095359,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":911}},"Global":{"acquireCount":{"r":911}},"Database":{"acquireCount":{"r":911}},"Collection":{"acquireCount":{"r":911}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1658}} {"t":{"$date":"2021-10-14T09:16:18.384+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"getMore":2609909263370750001,"collection":"invla","$db":"Vocen_DWD"},"originatingCommand":{"aggregate":"invla","pipeline":[{"$match":{"_id":{"$gte":{"$oid":"60ed58ba02f942f851230224"},"$lt":{"$oid":"610bc37035dab7f3973d9a2d"}}}},{"$match":{"$and":[{"COMPANY":{"$exists":true,"$ne":null}},{"count":{"$exists":true,"$ne":null}},{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$project":{"COMPANY":1,"LA001":1,"LA009":1,"count":1,"_id":0}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { _id: 1 }","cursorid":2609909263370750001,"keysExamined":588534,"docsExamined":588534,"fromMultiPlanner":true,"numYields":588,"nreturned":205254,"reslen":16777207,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":589}},"Global":{"acquireCount":{"r":589}},"Database":{"acquireCount":{"r":589}},"Collection":{"acquireCount":{"r":589}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":4315895,"timeReadingMicros":49043}},"protocol":"op_msg","durationMillis":1315}} {"t":{"$date":"2021-10-14T09:16:20.250+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"getMore":2609909263370750001,"collection":"invla","$db":"Vocen_DWD"},"originatingCommand":{"aggregate":"invla","pipeline":[{"$match":{"_id":{"$gte":{"$oid":"60ed58ba02f942f851230224"},"$lt":{"$oid":"610bc37035dab7f3973d9a2d"}}}},{"$match":{"$and":[{"COMPANY":{"$exists":true,"$ne":null}},{"count":{"$exists":true,"$ne":null}},{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$project":{"COMPANY":1,"LA001":1,"LA009":1,"count":1,"_id":0}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { _id: 1 }","cursorid":2609909263370750001,"keysExamined":409333,"docsExamined":409333,"fromMultiPlanner":true,"numYields":409,"nreturned":203865,"reslen":16777256,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":410}},"Global":{"acquireCount":{"r":410}},"Database":{"acquireCount":{"r":410}},"Collection":{"acquireCount":{"r":410}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":958}} {"t":{"$date":"2021-10-14T09:16:22.745+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"getMore":1612716104933424645,"collection":"invla","$db":"Vocen_DWD"},"originatingCommand":{"aggregate":"invla","pipeline":[{"$match":{"_id":{"$gte":{"$oid":"610bc37035dab7f3973d9a2d"}}}},{"$match":{"$and":[{"COMPANY":{"$exists":true,"$ne":null}},{"count":{"$exists":true,"$ne":null}},{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$project":{"COMPANY":1,"LA001":1,"LA009":1,"count":1,"_id":0}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { _id: 1 }","cursorid":1612716104933424645,"keysExamined":560354,"docsExamined":560354,"fromMultiPlanner":true,"numYields":560,"nreturned":203473,"reslen":16777252,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":561}},"Global":{"acquireCount":{"r":561}},"Database":{"acquireCount":{"r":561}},"Collection":{"acquireCount":{"r":561}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1313}} {"t":{"$date":"2021-10-14T09:16:24.658+08:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"Vocen_DWD.invla","command":{"getMore":1612716104933424645,"collection":"invla","$db":"Vocen_DWD"},"originatingCommand":{"aggregate":"invla","pipeline":[{"$match":{"_id":{"$gte":{"$oid":"610bc37035dab7f3973d9a2d"}}}},{"$match":{"$and":[{"COMPANY":{"$exists":true,"$ne":null}},{"count":{"$exists":true,"$ne":null}},{"COMPANY":"PTYW"},{"count":{"$gt":0}}]}},{"$project":{"COMPANY":1,"LA001":1,"LA009":1,"count":1,"_id":0}}],"cursor":{},"allowDiskUse":true,"$db":"Vocen_DWD","$readPreference":{"mode":"secondaryPreferred"}},"planSummary":"IXSCAN { _id: 1 }","cursorid":1612716104933424645,"keysExamined":463309,"docsExamined":463309,"fromMultiPlanner":true,"numYields":463,"nreturned":203473,"reslen":16777252,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":464}},"Global":{"acquireCount":{"r":464}},"Database":{"acquireCount":{"r":464}},"Collection":{"acquireCount":{"r":464}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1083}}

  综合以上三个例子,我们可以得出关于mongodb-spark-connector的以下结论

  • 只用mongodb索引做条件过滤和字段缩减
  • 复杂计算任务都在spark中执行
  • 多表关联在spark中执行

    最后做个总结,之所以敢说这是最简大数据开发环境,是因为本篇只用了sbt和VSCode,避免了繁杂的大数据组件搭建和配置工作。新猿可以用这个方案快速上手体验大数据开发;老猿也可以用这个方案做开发测试,省却版本维护之类的繁琐。

  如果看的不尽兴,或者想吐糟,欢迎Email拍砖983353083@qq.com

posted @ 2021-10-14 15:55  MissBug  阅读(363)  评论(0编辑  收藏  举报