博客园 首页 私信博主 显示目录 隐藏目录 管理 动画

Sqoop安装

2.3.1 下载并解压

1)下载地址:http://mirrors.hust.edu.cn/apache/sqoop/1.4.6/

2)上传安装包sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gzhadoop102的/opt/software路径中

3)解压sqoop安装包到指定目录,如:

[atguigu@hadoop102 software]$ tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/

4)解压sqoop安装包到指定目录,如:

[atguigu@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop

2.3.2 修改配置文件

1) 进入到/opt/module/sqoop/conf目录,重命名配置文件

[atguigu@hadoop102 conf]$ mv sqoop-env-template.sh sqoop-env.sh

2) 修改配置文件

[atguigu@hadoop102 conf]$ vim sqoop-env.sh

增加如下内容

export HADOOP_COMMON_HOME=/opt/module/hadoop-3.1.3

export HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3

export HIVE_HOME=/opt/module/hive

export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7

export ZOOCFGDIR=/opt/module/zookeeper-3.5.7/conf

2.3.3 拷贝JDBC驱动

1)将mysql-connector-java-5.1.48.jar 上传到/opt/software路径

2)进入到/opt/software/路径,拷贝jdbc驱动到sqooplib目录下。

[atguigu@hadoop102 software]$ cp mysql-connector-java-5.1.48.jar /opt/module/sqoop/lib/

2.3.4 验证Sqoop

我们可以通过某一个command来验证sqoop配置是否正确:

[atguigu@hadoop102 sqoop]$ bin/sqoop help

出现一些Warning警告(警告信息已省略),并伴随着帮助命令的输出:

Available commands:

  codegen            Generate code to interact with database records

  create-hive-table     Import a table definition into Hive

  eval               Evaluate a SQL statement and display the results

  export             Export an HDFS directory to a database table

  help               List available commands

  import             Import a table from a database to HDFS

  import-all-tables     Import tables from a database to HDFS

  import-mainframe    Import datasets from a mainframe server to HDFS

  job                Work with saved jobs

  list-databases        List available databases on a server

  list-tables           List available tables in a database

  merge              Merge results of incremental imports

  metastore           Run a standalone Sqoop metastore

  version            Display version information

2.3.5 测试Sqoop是否能够成功连接数据库

[atguigu@hadoop102 sqoop]$ bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 000000

出现如下输出:

information_schema

metastore

mysql

oozie

performance_schema

2.4 同步策略

数据同步策略的类型包括:全量同步、增量同步、新增及变化同步、特殊情况

  • 全量表:存储完整的数据。
  • 增量表:存储新增加的数据
  • 新增及变化表:存储新增加的数据和变化的数据。
  • 特殊表:只需要存储一次

2.4.1 全量同步策略

 

2.4.2 增量同步策略

 

2.4.3 新增及变化策略

 

2.4.4 特殊策略

某些特殊的维度表,可不必遵循上述同步策略。

1)客观世界维度

没变化的客观世界的维度(比如性别,地区,民族,政治成分,鞋子尺码)可以只存一份固定值。

2)日期维度

日期维度可以一次性导入一年或若干年的数据。

2.5 业务数据导入HDFS

2.5.1 分析表同步策略

在生产环境,个别小公司,为了简单处理,所有表全量导入。

中大型公司,由于数据量比较大,还是严格按照同步策略导入数据。

 

 

 

2.5.2 脚本编写

1)在/home/atguigu/bin目录下创建

[atguigu@hadoop102 bin]$ vim mysql_to_hdfs.sh

添加如下内容:

#! /bin/bash

 

APP=gmall

sqoop=/opt/module/sqoop/bin/sqoop

 

if [ -n "$2" ] ;then

    do_date=$2

else

    do_date=`date -d '-1 day' +%F`

fi

 

import_data(){

$sqoop import \

--connect jdbc:mysql://hadoop102:3306/$APP \

--username root \

--password 000000 \

--target-dir /origin_data/$APP/db/$1/$do_date \

--delete-target-dir \

--query "$2 and  \$CONDITIONS" \

--num-mappers 1 \

--fields-terminated-by '\t' \

--compress \

--compression-codec lzop \

--null-string '\\N' \

--null-non-string '\\N'

 

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/$APP/db/$1/$do_date

}

 

import_order_info(){

  import_data order_info "select

                            id,

                            final_total_amount,

                            order_status,

                            user_id,

                            out_trade_no,

                            create_time,

                            operate_time,

                            province_id,

                            benefit_reduce_amount,

                            original_total_amount,

                            feight_fee      

                        from order_info

                        where (date_format(create_time,'%Y-%m-%d')='$do_date'

                        or date_format(operate_time,'%Y-%m-%d')='$do_date')"

}

 

import_coupon_use(){

  import_data coupon_use "select

                          id,

                          coupon_id,

                          user_id,

                          order_id,

                          coupon_status,

                          get_time,

                          using_time,

                          used_time

                        from coupon_use

                        where (date_format(get_time,'%Y-%m-%d')='$do_date'

                        or date_format(using_time,'%Y-%m-%d')='$do_date'

                        or date_format(used_time,'%Y-%m-%d')='$do_date')"

}

 

import_order_status_log(){

  import_data order_status_log "select

                                  id,

                                  order_id,

                                  order_status,

                                  operate_time

                                from order_status_log

                                where date_format(operate_time,'%Y-%m-%d')='$do_date'"

}

 

import_activity_order(){

  import_data activity_order "select

                                id,

                                activity_id,

                                order_id,

                                create_time

                              from activity_order

                              where date_format(create_time,'%Y-%m-%d')='$do_date'"

}

 

import_user_info(){

  import_data "user_info" "select

                            id,

                            name,

                            birthday,

                            gender,

                            email,

                            user_level,

                            create_time,

                            operate_time

                          from user_info

                          where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'

                          or DATE_FORMAT(operate_time,'%Y-%m-%d')='$do_date')"

}

 

import_order_detail(){

  import_data order_detail "select

                              od.id,

                              order_id,

                              user_id,

                              sku_id,

                              sku_name,

                              order_price,

                              sku_num,

                              od.create_time,

                              source_type,

                              source_id  

                            from order_detail od

                            join order_info oi

                            on od.order_id=oi.id

                            where DATE_FORMAT(od.create_time,'%Y-%m-%d')='$do_date'"

}

 

import_payment_info(){

  import_data "payment_info"  "select

                                id,  

                                out_trade_no,

                                order_id,

                                user_id,

                                alipay_trade_no,

                                total_amount,  

                                subject,

                                payment_type,

                                payment_time

                              from payment_info

                              where DATE_FORMAT(payment_time,'%Y-%m-%d')='$do_date'"

}

 

import_comment_info(){

  import_data comment_info "select

                              id,

                              user_id,

                              sku_id,

                              spu_id,

                              order_id,

                              appraise,

                              comment_txt,

                              create_time

                            from comment_info

                            where date_format(create_time,'%Y-%m-%d')='$do_date'"

}

 

import_order_refund_info(){

  import_data order_refund_info "select

                                id,

                                user_id,

                                order_id,

                                sku_id,

                                refund_type,

                                refund_num,

                                refund_amount,

                                refund_reason_type,

                                create_time

                              from order_refund_info

                              where date_format(create_time,'%Y-%m-%d')='$do_date'"

}

 

import_sku_info(){

  import_data sku_info "select

                          id,

                          spu_id,

                          price,

                          sku_name,

                          sku_desc,

                          weight,

                          tm_id,

                          category3_id,

                          create_time

                        from sku_info where 1=1"

}

 

import_base_category1(){

  import_data "base_category1" "select

                                  id,

                                  name

                                from base_category1 where 1=1"

}

 

import_base_category2(){

  import_data "base_category2" "select

                                  id,

                                  name,

                                  category1_id

                                from base_category2 where 1=1"

}

 

import_base_category3(){

  import_data "base_category3" "select

                                  id,

                                  name,

                                  category2_id

                                from base_category3 where 1=1"

}

 

import_base_province(){

  import_data base_province "select

                              id,

                              name,

                              region_id,

                              area_code,

                              iso_code

                            from base_province

                            where 1=1"

}

 

import_base_region(){

  import_data base_region "select

                              id,

                              region_name

                            from base_region

                            where 1=1"

}

 

import_base_trademark(){

  import_data base_trademark "select

                                tm_id,

                                tm_name

                              from base_trademark

                              where 1=1"

}

 

import_spu_info(){

  import_data spu_info "select

                            id,

                            spu_name,

                            category3_id,

                            tm_id

                          from spu_info

                          where 1=1"

}

 

import_favor_info(){

  import_data favor_info "select

                          id,

                          user_id,

                          sku_id,

                          spu_id,

                          is_cancel,

                          create_time,

                          cancel_time

                        from favor_info

                        where 1=1"

}

 

import_cart_info(){

  import_data cart_info "select

                        id,

                        user_id,

                        sku_id,

                        cart_price,

                        sku_num,

                        sku_name,

                        create_time,

                        operate_time,

                        is_ordered,

                        order_time,

                        source_type,

                        source_id

                      from cart_info

                      where 1=1"

}

 

import_coupon_info(){

  import_data coupon_info "select

                          id,

                          coupon_name,

                          coupon_type,

                          condition_amount,

                          condition_num,

                          activity_id,

                          benefit_amount,

                          benefit_discount,

                          create_time,

                          range_type,

                          spu_id,

                          tm_id,

                          category3_id,

                          limit_num,

                          operate_time,

                          expire_time

                        from coupon_info

                        where 1=1"

}

 

import_activity_info(){

  import_data activity_info "select

                              id,

                              activity_name,

                              activity_type,

                              start_time,

                              end_time,

                              create_time

                            from activity_info

                            where 1=1"

}

 

import_activity_rule(){

    import_data activity_rule "select

                                    id,

                                    activity_id,

                                    condition_amount,

                                    condition_num,

                                    benefit_amount,

                                    benefit_discount,

                                    benefit_level

                                from activity_rule

                                where 1=1"

}

 

import_base_dic(){

    import_data base_dic "select

                            dic_code,

                            dic_name,

                            parent_code,

                            create_time,

                            operate_time

                          from base_dic

                          where 1=1"

}

 

case $1 in

  "order_info")

     import_order_info

;;

  "base_category1")

     import_base_category1

;;

  "base_category2")

     import_base_category2

;;

  "base_category3")

     import_base_category3

;;

  "order_detail")

     import_order_detail

;;

  "sku_info")

     import_sku_info

;;

  "user_info")

     import_user_info

;;

  "payment_info")

     import_payment_info

;;

  "base_province")

     import_base_province

;;

  "base_region")

     import_base_region

;;

  "base_trademark")

     import_base_trademark

;;

  "activity_info")

      import_activity_info

;;

  "activity_order")

      import_activity_order

;;

  "cart_info")

      import_cart_info

;;

  "comment_info")

      import_comment_info

;;

  "coupon_info")

      import_coupon_info

;;

  "coupon_use")

      import_coupon_use

;;

  "favor_info")

      import_favor_info

;;

  "order_refund_info")

      import_order_refund_info

;;

  "order_status_log")

      import_order_status_log

;;

  "spu_info")

      import_spu_info

;;

  "activity_rule")

      import_activity_rule

;;

  "base_dic")

      import_base_dic

;;

 

"first")

   import_base_category1

   import_base_category2

   import_base_category3

   import_order_info

   import_order_detail

   import_sku_info

   import_user_info

   import_payment_info

   import_base_province

   import_base_region

   import_base_trademark

   import_activity_info

   import_activity_order

   import_cart_info

   import_comment_info

   import_coupon_use

   import_coupon_info

   import_favor_info

   import_order_refund_info

   import_order_status_log

   import_spu_info

   import_activity_rule

   import_base_dic

;;

"all")

   import_base_category1

   import_base_category2

   import_base_category3

   import_order_info

   import_order_detail

   import_sku_info

   import_user_info

   import_payment_info

   import_base_trademark

   import_activity_info

   import_activity_order

   import_cart_info

   import_comment_info

   import_coupon_use

   import_coupon_info

   import_favor_info

   import_order_refund_info

   import_order_status_log

   import_spu_info

   import_activity_rule

   import_base_dic

;;

esac

说明1

[ -n 变量 ] 判断变量的值,是否为空

-变量的值,非空,返回true

-变量的值,为空,返回false

说明2

查看date命令的使用,[atguigu@hadoop102 ~]$ date --help

2)修改脚本权限

[atguigu@hadoop102 bin]$ chmod 777 mysql_to_hdfs.sh

3)初次导入

[atguigu@hadoop102 bin]$ mysql_to_hdfs.sh first 2020-06-14

4)每日导入

[atguigu@hadoop102 bin]$ mysql_to_hdfs.sh all 2020-06-15

2.5.3 项目经验

Hive中的Null在底层是以“\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。在导出数据时采用--input-null-string--input-null-non-string两个参数。导入数据时采用--null-string--null-non-string

posted @ 2022-05-31 19:29  CHANG_09  阅读(275)  评论(0编辑  收藏  举报