kettle使用

kettle入门实战

一、kettle概述

1、什么是kettle

Kettle是一款开源的ETL工具，纯java编写，可以在Window、Linux、Unix上运行，绿色无需安装，数据抽取高效稳定。

2、Kettle工程存储方式

（1）以XML形式存储

（2）以资源库方式存储(数据库资源库和文件资源库)

3、Kettle的两种设计

4、Kettle的组成

5、kettle特点

二、kettle安装部署和使用

1、 kettle安装地址（官网地址）

https://community.hitachivantara.com/docs/DOC-1009855

下载地址：https://sourceforge.net/projects/pentaho/files/Data%20Integration/

资料下载：

链接：https://pan.baidu.com/s/149fBww3eiD7vLN2p2egCxg

提取码：gyhr

2、Windows下安装使用

（1）概述

在实际企业开发中，都是在本地环境下进行kettle的job和Transformation开发的，可以在本地运行，也可以连接远程机器运行

（2）安装步骤

安装jdk

下载kettle压缩包，因kettle为绿色软件，解压缩到任意本地路径即可

双击Spoon.bat，启动图形化界面工具，就可以直接使用了

3、案例1

案例一把stu1的数据按id同步到stu2，stu2有相同id则更新数据

(1)在mysql中创建两张表

mysql> create database kettle;
Query OK, 1 row affected (0.00 sec)

mysql> use kettle;
Database changed

mysql>  create table stu1(id int,name varchar(20),age int);
Query OK, 0 rows affected (0.01 sec)

mysql> create table stu2(id int,name varchar(20));
Query OK, 0 rows affected (0.00 sec)

（2）往两张表中插入一些数据

mysql> insert into stu1 values(1001,'zhangsan',20),(1002,'lisi',18), (1003,'wangwu',23);
Query OK, 3 rows affected (0.05 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> insert into stu2 values(1001,'wukong');
Query OK, 1 row affected (0.00 sec)

（3）把pdi-ce-8.2.0.0-342.zip文件拷贝到win环境中指定文件目录，解压后

在kettle中新建转换--->输入--->表输入-->表输入双击

在数据库连接栏目点击新建

以上错误说明，少了mysql-connector-java-5.1.27-bin.jar

解决方法：

在data-integration\lib文件下添加mysql-connector-java-5.1.27-bin.jar

再重启，再次操作

以上说明stu1的数据输入ok的，现在我们需要把输入stu1的数据同步到stu2输出的数据

注意：拖出来的线条必须是深灰色才关联成功，若是浅灰色表示关联失败

转换之前，需要做保存

之后，在mysql查看，stu2的数据，注意（自己转换都是改成N）

mysql> select * from  stu2;
+------+--------+
| id   | name   |
+------+--------+
| 1001 | wukong |
| 1002 | lisi   |
| 1003 | wangwu |
+------+--------+
3 rows in set (0.00 sec)

若：改动

//查出来的数据有所变动
mysql> select * from  stu2;
+------+----------+
| id   | name     |
+------+----------+
| 1001 | zhangsan |
| 1002 | lisi     |
| 1003 | wangwu   |
+------+----------+
3 rows in set (0.00 sec)

4、案例2：使用作业执行上述转换，并且额外在表student2中添加一条数据

(1)新建一个作业

(2) 按图示拉取组件

(3)双击Start编辑Start

(4)双击转换，选择案例1保存的文件

(5)双击SQL，编辑SQL语句，先在mysql的kettle数据库中插入一条数据

mysql> insert into stu1 values(1004,'stu1',22);
Query OK, 1 row affected (0.01 sec)

之后，加上Dummy，如图所示：

之后，必须保存，不然不会生效

接下来，我们就可以执行了

再次，在mysql数据库查看，有数据了

mysql> select * from stu2;
+------+----------+
| id   | name     |
+------+----------+
| 1001 | zhangsan |
| 1002 | lisi     |
| 1003 | wangwu   |
| 1004 | stu1     |
| 1005 | kettle   |
+------+----------+
5 rows in set (0.00 sec)

5、案例3：将hive表的数据输出到hdfs

（1）因为涉及到hive和hbase的读写，需要修改相关配置文件。

修改解压目录下的data-integration\plugins\pentaho-big-data-plugin下的plugin.properties，设置active.hadoop.configuration=hdp26，并将如下配置文件拷贝到data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp26下

(2)启动hdfs，yarn，hbase集群的所有进程，启动hiveserver2服务

[root@hadoop1 ~]# /opt/module/hadoop-2.7.2/sbin/start-all.sh
开启HBase前启动Zookeeper
[root@hadoop1 ~]# /opt/module/hbase-1.3.1/bin/start-hbase.sh
[root@hadoop1 ~]# /opt/module/hive/bin/hiveserver2

(3)进入beeline，查看10000端口开启情况

[root@hadoop1 ~]# /opt/module/hive/bin/beeline
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://hadoop1.x:10000（回车）
Connecting to jdbc:hive2://hadoop1.x:10000
Enter username for jdbc:hive2://hadoop1.x:10000: root（输入root）
Enter password for jdbc:hive2://hadoop1.x:10000:（直接回车）
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop1.x:10000>（到了这里说明成功开启10000端口）

(4)创建两张表dept和emp

CREATE TABLE dept(deptno int, dname string,loc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm int,
deptno int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

(5)插入数据

insert into dept values(10,'accounting','NEW YORK'),(20,'RESEARCH','DALLAS'),(30,'SALES','CHICAGO'),(40,'OPERATIONS','BOSTON');
insert into emp values(7369,'SMITH','CLERK',7902,'1980-12-17',800,NULL,20),(7499,'ALLEN','SALESMAN',7698,'1980-12-17',1600,300,30),(7521,'WARD','SALESMAN',7698,'1980-12-17',1250,500,30),(7566,'JONES','MANAGER',7839,'1980-12-17',2975,NULL,20);

(6)按下图建立流程图

(7)设置表输入，连接hive

(8)设置排序属性

(9)设置连接属性

(10)设置字段选择

(11)设置文件输出

(12)保存并运行查看hdfs

6、案例4：读取hdfs文件并将sal大于1000的数据保存到hbase中

(1) 在HBase中创建一张表用于存放数据

[root@hadoop1 ~]$ /opt/module/hbase-1.3.1/bin/hbase shell
hbase(main):004:0> create 'people','info'

(2)按下图建立流程图

(3)设置文件输入，连接hdfs

(4)设置过滤记录

(5)设置HBase output

注意：若报错没有权限往hdfs写文件，在Spoon.bat中第119行添加参数

"-DHADOOP_USER_NAME=atguigu" "-Dfile.encoding=UTF-8"

三、创建资源库

1、数据库资源库

数据库资源库是将作业和转换相关的信息存储在数据库中，执行的时候直接去数据库读取信息，很容易跨平台使用

1)点击右上角connect，选择Other Resporitory

2) 选择Database Repository

3) 建立新连接

4) 填好之后，点击finish，会在指定的库中创建很多表，至此数据库资源库创建完成

5) 连接资源库

默认账号密码为admin

6) 将之前做过的转换导入资源库

(1)选择从xml文件导入

(2)随便选择一个转换

(3)点击保存，选择存储位置及文件名

(4)打开资源库查看保存结果

2、文件资源库

将作业和转换相关的信息存储在指定的目录中，其实和XML的方式一样

创建方式跟创建数据库资源库步骤类似，只是不需要用户密码就可以访问，跨

平台使用比较麻烦

1)选择connect

2)点击add后点击Other Repositories

3)选择File Repository

4)填写信息

四、 Linux下安装使用

1、单机

1)jdk安装

2)安装包上传到服务器，解压

注意：1. 把mysql驱动拷贝到lib目录下 \2. 将本地用户家目录下的隐藏目录C:\Users\自己用户名.kettle，整个上传到linux的家目录/home/MrZhou/下

3)运行数据库资源库中的转换：

[root@hadoop1 data-integration]$./pan.sh -rep=my_repo -user=admin -pass=admin -trans=stu1tostu2 -dir=/

参数说明：

-rep 资源库名称 -user 资源库用户名 -pass 资源库密码 -trans 要启动的转换名称 -dir 目录(不要忘了前缀 /)

4)运行资源库里的作业：

记得把作业里的转换变成资源库中的资源

[root@hadoop1 data-integration]$./kitchen.sh -rep=repo1 -user=admin -pass=admin -job=jobDemo1 -logfile=./logs/log.txt -dir=/

参数说明： -rep - 资源库名 -user - 资源库用户名 -pass – 资源库密码 -job – job名 -dir – job路径 -logfile – 日志目录

2、集群模式(了解)

1) 准备三台服务器，hadoop1.x作为Kettle主服务器，服务器端口号为8080，hadoop2.x和hadoop3.x作为两个子服务器，端口号分别为8081和8082。

2) 安装部署jdk

3) hadoop完全分布式环境搭建，并启动进程(因为要使用hdfs)

4) 上传解压kettle的安装包

5) 进到/opt/module/data-integration/pwd目录，修改配置文件

修改主服务器配置文件carte-config-master-8080.xml

<slaveserver>
    <name>master</name>
    <hostname>hadoop1.x</hostname>
    <port>8080</port>
    <master>Y</master>
    <username>cluster</username>
    <password>cluster</password>
  </slaveserver>

修改从服务器配置文件carte-config-8081.xml

 <masters>
    <slaveserver>
      <name>master</name>
      <hostname>hadoop102</hostname>
      <port>8080</port>
      <username>cluster</username>
      <password>cluster</password>
      <master>Y</master>
    </slaveserver>
  </masters>
  <report_to_masters>Y</report_to_masters>
  <slaveserver>
    <name>slave1</name>
    <hostname>hadoop2.x</hostname>
    <port>8081</port>
    <username>cluster</username>
    <password>cluster</password>
    <master>N</master>
  </slaveserver>

修改从配置文件carte-config-8082.xml

<masters>
    <slaveserver>
      <name>master</name>
      <hostname>hadoop102</hostname>
      <port>8080</port>
      <username>cluster</username>
      <password>cluster</password>
      <master>Y</master>
    </slaveserver>
  </masters>
  <report_to_masters>Y</report_to_masters>
  <slaveserver>
    <name>slave2</name>
    <hostname>hadoop3.x</hostname>
    <port>8082</port>
    <username>cluster</username>
    <password>cluster</password>
    <master>N</master>
  </slaveserver>

6) 分发整个kettle的安装目录，xsync data-integration

7) 启动相关进程，在hadoop1.x,hadoop2.x,hadoop3.x上执行

[root@hadoop1.x data-integration]# ./carte.sh hadoop1.x 8080
[root@hadoop2.x data-integration]#./carte.sh hadoop2.x 8081
[root@hadoop3.x data-integration]#./carte.sh hadoop3.x 8082

8) 访问web页面

http://hadoop1.x:8080

3、案例：读取hive中的emp表，根据id进行排序，并将结果输出到hdfs上

注意：因为涉及到hive和hbase的读写，需要修改相关配置文件。

修改解压目录下的data-integration\plugins\pentaho-big-data-plugin下的plugin.properties，设置active.hadoop.configuration=hdp26，并将如下配置文件拷贝到data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp26下

(1) 创建转换，编辑步骤，填好相关配置