hive表导入waterdrop数据,记得看版本,小于spark2.3的自行下载1.5自带spark的版本
配置batch.conf.template拷贝一个为batch.conf(以下是我例子配置,按照自己需求可以做调整)
###### ###### This config file is a demonstration of batch processing in waterdrop config ###### spark { # You can set spark configuration here # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties spark.app.name = "Waterdrop" spark.executor.instances = 2 spark.executor.cores = 1 spark.executor.memory = "1g" } input { # This is a example input plugin **only for test and demonstrate the feature input plugin** hive { pre_sql = "select * from terminal.XX" result_table_name = "XX" } # You can also use other input plugins, such as hdfs # hdfs { # result_table_name = "accesslog" # path = "hdfs://hadoop-cluster-01/nginx/accesslog" # format = "json" # } # If you would like to get more information about how to configure waterdrop and see full list of input plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base } filter { # # split data by specific delimiter # split { # fields = ["msg", "name"] # delimiter = " " # result_table_name = "accesslog" # remove { # source_field = ["imei1", "imei2"] # } # } # you can also you other filter plugins, such as sql # sql { # sql = "select * from accesslog where request_time > 1000" # } # If you would like to get more information about how to configure waterdrop and see full list of filter plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base } output { # choose stdout output plugin to output data to console #stdout { #} clickhouse { host = "127.0.0.1:8123" database = "waterdrop" table = "access_log" fields = ["XX","day"] username = "user_richdm" password = "richdm" } # you can also you other output plugins, such as sql # hdfs { # path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed" # save_mode = "append" # } # If you would like to get more information about how to configure waterdrop and see full list of output plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base }
执行命令
./start-waterdrop.sh --master yarn --deploy-mode client --config ../config/batch.conf
clickhouse的库表都要预先建立好。不会自动给你建立