使用Kettle的Hive交换简单模型

目录

 

前言

一、5.1.0.0-752的jar包运行

引包

编写HIVE2HIVE的Java代码

初始化环境

 编写交换代码

编写HIVE的连接信息

二、在hive中创建表 

三、运行代码

问题点

感想


前言

        还是用以前搭建的那个Kettle环境,当然引用的jar包也是以前5.1.0.0-752,这个是公司仓库上的包,已经很老很老了。

        用这个包去做hive到hive的交换的话,需要改源码。

        我下载的源码还是从这个地址下载的,这次下载的是5.1的源码,源码地址:https://github.com/pentaho/pentaho-kettle 

        当然这个源码还是需要修改pom文件才能编译正常,这个和下载的8.2的源码修改一样,参考https://blog.csdn.net/lw18751836671/article/details/119600385

        这里的话将9.3.0.0-snapshot修改为9.3.0.0-157,搞不懂8.2的源码是8.2.0.0-342的版本,比这之前的源码竟然是9.3.0.0-157这种。

        这里只需要将5.1的源码处理kettle-core和kettle-engine两个模块即可。

        打包到仓库中变成如下两个jar包,这样就可以在工程中替换以前的5.1.0.0-752这个版本的kettle的jar包。

         下面先说一下没有替换以前5.1.0.0-752的jar包产生的事。

一、5.1.0.0-752的jar包运行

引包

        这是hive的jdbc连接包,

 <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>1.2.1</version>
        </dependency>

编写HIVE2HIVE的Java代码

初始化环境

@Before
	public void before() {
		try {
			// 初始化Kettle环境
			KettleEnvironment.init();
			EnvUtil.environmentInit();
		} catch (KettleException e) {
			e.printStackTrace();
		}
	}

 编写交换代码

        这里每一行的含义已经在之前博客阐述过,不再过多描述。

        这里的模型很简单,就是一个表输入到表输出,去掉了以前的字段选择啊,替换等等的步骤,而且连接也使用了XML方式,简单明了。

/**
	 * hive之间的交换
	 * @throws KettleException 
	 */
	@Test
	public void exchangeHive2Hive() throws KettleException{
		//源数据库连接
		String hive_src = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
                "<connection>" +
                "<name>hive_src</name>" +
                "<server>192.168.10.212</server>" +
                "<type>HIVE2</type>" +
                "<access>Native</access>" +
                "<database>ntzw_dev_64</database>" +
                "<port>10000</port>" +
                "<username>hadoop</username>" +
                "<password>hadoop</password>" +
                "</connection>";
		
		//目标数据库连接
        String hive_dest = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
                "<connection>" +
                "<name>hive_dest</name>" +
                "<server>192.168.10.212</server>" +
                "<type>HIVE2</type>" +
                "<access>Native</access>" +
                "<database>ntzw_dev_64</database>" +
                "<port>10000</port>" +
                "<username>hadoop</username>" +
                "<password>hadoop</password>" +
                "</connection>";
        
        DatabaseMeta srcDatabaseMeta = new DatabaseMeta(hive_src);
        DatabaseMeta destDatabaseMeta = new DatabaseMeta(hive_dest);
        
        //创建转换元信息
        TransMeta transMeta = new TransMeta();
		transMeta.setName("hive之间的交换");
		
		//设置源和目标
		transMeta.addDatabase(srcDatabaseMeta);
		transMeta.addDatabase(destDatabaseMeta);
		
		/*
		 * 创建  表输入->表输出
		 * 同时将两个步骤连接起来
		 */
		PluginRegistry registry = PluginRegistry.getInstance();
		TableInputMeta tableInputMeta = new TableInputMeta();
		String tableInputPluginId = registry.getPluginId(StepPluginType.class,
				tableInputMeta);
		
		tableInputMeta.setDatabaseMeta(srcDatabaseMeta);
		//设置查询条件
		String selectSql = "select id ,name from user_info_src";
		tableInputMeta.setSQL(selectSql);
		
		StepMeta tableInputStepMeta = new StepMeta(tableInputPluginId,
				"tableInput", (StepMetaInterface) tableInputMeta);
		transMeta.addStep(tableInputStepMeta);
		
		TableOutputMeta tableOutputMeta = new TableOutputMeta();
		tableOutputMeta.setDatabaseMeta(destDatabaseMeta);
		
		//设置目标表的 schema和表名
		tableOutputMeta.setSchemaName(null);
		tableOutputMeta.setTablename("user_info_dest");
		
		String tableOutputPluginId = registry.getPluginId(StepPluginType.class, tableOutputMeta);
		StepMeta tableOutputStep = new StepMeta(tableOutputPluginId, "tableOutput" , (StepMetaInterface) tableOutputMeta);
		
		//将步骤添加进去
		transMeta.addStep(tableOutputStep);
		
		//将步骤和上一步关联起来
		transMeta.addTransHop(new TransHopMeta(tableInputStepMeta, tableOutputStep));
		
		Trans trans = new Trans(transMeta);

		//执行转换
		trans.execute(null);
		
		//等待完成
		trans.waitUntilFinished();
		if (trans.getErrors() > 0) {
			System.out.println("交换出错.");
			return;
		}
        
	}

编写HIVE的连接信息

         要知道kettle目前这个jar包是不支持HIVE交换的,我下载的8.1版本也是不支持的,然后在上面连接的XML代码中,有如下一个type节点,定义了HIVE2类型,

<type>HIVE2</type>

        这个是kettle-core模块从kettle-database-types.xml中寻找对应的数据库源数据信息的,而这个肯定是没有的。

         所以我们需要模仿里面的代码写一个Hive的数据源的元数据信息,首先在xml中加入如下配置,

  <database-type id="HIVE2">
    <description>HIVE2</description>
    <classname>org.pentaho.di.core.database.Hive2SQLDatabaseMeta</classname>
  </database-type>

        然后写Hive2SQLDatabaseMeta这个类,代码如下

package org.pentaho.di.core.database;

import org.pentaho.di.core.Const;
import org.pentaho.di.core.row.ValueMetaInterface;

/**
 * 注释:kettle的databseType没用HIVE2,需要在kettle-core中的kettle-database-types.xml文件添加对应的hive2的数据库类型,并需要实现此类
 */


/**
 * 此方法是参考 PostgreSQLDatabaseMeta.java以及 pentaho-big-data-bundles的pentaho-big-data-kettle-plugins-hive模块中DatabaseMetaWithVersion.java 来写的
 */
public class Hive2SQLDatabaseMeta extends BaseDatabaseMeta  {
    public Hive2SQLDatabaseMeta() {
    }
    @Override
    public String getExtraOptionSeparator() {
        return "&";
    }
    @Override
    public String getExtraOptionIndicator() {
        return "?";
    }
    @Override
    public int[] getAccessTypeList() {
        return new int[DatabaseMeta.TYPE_ACCESS_NATIVE];
    }
    @Override
    public int getDefaultDatabasePort() {
        return this.getAccessType() == 0 ? 10000 : -1;
    }

    @Override
    public String getDriverClass() {
        return this.getAccessType() == 1 ? "sun.jdbc.odbc.JdbcOdbcDriver" : "org.apache.hive.jdbc.HiveDriver";
    }

    @Override
    public String getURL(String hostname, String port, String databaseName) {
        return this.getAccessType() == 1 ? "jdbc:odbc:" + databaseName : "jdbc:hive2://" + hostname + ":" + port + "/" + databaseName;
    }
    @Override
    public boolean isFetchSizeSupported() {
        return true;
    }
    @Override
    public boolean supportsBitmapIndex() {
        return false;
    }
    @Override
    public boolean supportsSynonyms() {
        return false;
    }
    @Override
    public boolean supportsSequences() {
        return true;
    }
    @Override
    public boolean supportsSequenceNoMaxValueOption() {
        return true;
    }
    @Override
    public boolean supportsAutoInc() {
        return true;
    }
    @Override
    public String getLimitClause(int nrRows) {
        return " limit " + nrRows;
    }
    @Override
    public String getSQLQueryFields(String tableName) {
        return "SELECT * FROM " + tableName + this.getLimitClause(1);
    }
    @Override
    public String getSQLTableExists(String tablename) {
        return this.getSQLQueryFields(tablename);
    }
    @Override
    public String getSQLColumnExists(String columnname, String tablename) {
        return this.getSQLQueryColumnFields(columnname, tablename);
    }
    public String getSQLQueryColumnFields(String columnname, String tableName) {
        return "SELECT " + columnname + " FROM " + tableName + this.getLimitClause(1);
    }
    @Override
    public boolean needsToLockAllTables() {
        return false;
    }

/*    @Override
    public String getSQLListOfSequences() {
        return "SELECT relname AS sequence_name FROM pg_catalog.pg_statio_all_sequences";
    }*/
    @Override
    public String getSQLNextSequenceValue(String sequenceName) {
        return "SELECT nextval('" + sequenceName + "')";
    }
    @Override
    public String getSQLCurrentSequenceValue(String sequenceName) {
        return "SELECT currval('" + sequenceName + "')";
    }
/*    @Override
    public String getSQLSequenceExists(String sequenceName) {
        return "SELECT relname AS sequence_name FROM pg_catalog.pg_statio_all_sequences WHERE relname = '" + sequenceName.toLowerCase() + "'";
    }*/
    @Override
    public String getAddColumnStatement(String tablename, ValueMetaInterface v, String tk, boolean use_autoinc, String pk, boolean semicolon) {
        return "ALTER TABLE " + tablename + " ADD " + getFieldDefinition( v, tk, pk, use_autoinc, true, false );
    }
    @Override
    public String getDropColumnStatement(String tablename, ValueMetaInterface v, String tk, boolean use_autoinc, String pk, boolean semicolon) {
        return "ALTER TABLE " + tablename + " DROP COLUMN " + v.getName();
    }
    @Override
    public String getModifyColumnStatement(String tablename, ValueMetaInterface v, String tk, boolean use_autoinc, String pk, boolean semicolon) {
        return "ALTER TABLE " + tablename + " MODIFY " + getFieldDefinition( v, tk, pk, use_autoinc, true, false );
    }
    @Override
    public String getFieldDefinition(ValueMetaInterface v, String tk, String pk, boolean use_autoinc, boolean add_fieldname, boolean add_cr) {
        String retval = "";

        String fieldname = v.getName();
        int length = v.getLength();
        int precision = v.getPrecision();

        if ( add_fieldname ) {
            retval += fieldname + " ";
        }

        int type = v.getType();
        switch ( type ) {

            case ValueMetaInterface.TYPE_BOOLEAN:
                retval += "BOOLEAN";
                break;

            case ValueMetaInterface.TYPE_DATE:
                retval += "DATE";
                break;

            case ValueMetaInterface.TYPE_TIMESTAMP:
                retval += "TIMESTAMP";
                break;

            case ValueMetaInterface.TYPE_STRING:
                retval += "STRING";
                break;

            case ValueMetaInterface.TYPE_NUMBER:
            case ValueMetaInterface.TYPE_INTEGER:
            case ValueMetaInterface.TYPE_BIGNUMBER:
                // Integer values...
                if ( precision == 0 ) {
                    if ( length > 9 ) {
                        if ( length < 19 ) {
                            // can hold signed values between -9223372036854775808 and 9223372036854775807
                            // 18 significant digits
                            retval += "BIGINT";
                        } else {
                            retval += "FLOAT";
                        }
                    } else {
                        retval += "INT";
                    }
                } else {
                    // Floating point values...
                    if ( length > 15 ) {
                        retval += "FLOAT";
                    } else {
                        // A double-precision floating-point number is accurate to approximately 15 decimal places.
                        // http://mysql.mirrors-r-us.net/doc/refman/5.1/en/numeric-type-overview.html
                        retval += "DOUBLE";
                    }
                }

                break;
        }

        return retval;
    }
    @Override
    public String getSQLListOfProcedures() {
        return null;
    }
    @Override
    public String[] getReservedWords() {
        return new String[]{"A", "ABORT", "ABS", "ABSOLUTE", "ACCESS", "ACTION", "ADA", "ADD", "ADMIN", "AFTER", "AGGREGATE", "ALIAS", "ALL", "ALLOCATE", "ALSO", "ALTER", "ALWAYS", "ANALYSE", "ANALYZE", "AND", "ANY", "ARE", "ARRAY", "AS", "ASC", "ASENSITIVE", "ASSERTION", "ASSIGNMENT", "ASYMMETRIC", "AT", "ATOMIC", "ATTRIBUTE", "ATTRIBUTES", "AUTHORIZATION", "AVG", "BACKWARD", "BEFORE", "BEGIN", "BERNOULLI", "BETWEEN", "BIGINT", "BINARY", "BIT", "BITVAR", "BIT_LENGTH", "BLOB", "BOOLEAN", "BOTH", "BREADTH", "BY", "C", "CACHE", "CALL", "CALLED", "CARDINALITY", "CASCADE", "CASCADED", "CASE", "CAST", "CATALOG", "CATALOG_NAME", "CEIL", "CEILING", "CHAIN", "CHAR", "CHARACTER", "CHARACTERISTICS", "CHARACTERS", "CHARACTER_LENGTH", "CHARACTER_SET_CATALOG", "CHARACTER_SET_NAME", "CHARACTER_SET_SCHEMA", "CHAR_LENGTH", "CHECK", "CHECKED", "CHECKPOINT", "CLASS", "CLASS_ORIGIN", "CLOB", "CLOSE", "CLUSTER", "COALESCE", "COBOL", "COLLATE", "COLLATION", "COLLATION_CATALOG", "COLLATION_NAME", "COLLATION_SCHEMA", "COLLECT", "COLUMN", "COLUMN_NAME", "COMMAND_FUNCTION", "COMMAND_FUNCTION_CODE", "COMMENT", "COMMIT", "COMMITTED", "COMPLETION", "CONDITION", "CONDITION_NUMBER", "CONNECT", "CONNECTION", "CONNECTION_NAME", "CONSTRAINT", "CONSTRAINTS", "CONSTRAINT_CATALOG", "CONSTRAINT_NAME", "CONSTRAINT_SCHEMA", "CONSTRUCTOR", "CONTAINS", "CONTINUE", "CONVERSION", "CONVERT", "COPY", "CORR", "CORRESPONDING", "COUNT", "COVAR_POP", "COVAR_SAMP", "CREATE", "CREATEDB", "CREATEROLE", "CREATEUSER", "CROSS", "CSV", "CUBE", "CUME_DIST", "CURRENT", "CURRENT_DATE", "CURRENT_DEFAULT_TRANSFORM_GROUP", "CURRENT_PATH", "CURRENT_ROLE", "CURRENT_TIME", "CURRENT_TIMESTAMP", "CURRENT_TRANSFORM_GROUP_FOR_TYPE", "CURRENT_USER", "CURSOR", "CURSOR_NAME", "CYCLE", "DATA", "DATABASE", "DATE", "DATETIME_INTERVAL_CODE", "DATETIME_INTERVAL_PRECISION", "DAY", "DEALLOCATE", "DEC", "DECIMAL", "DECLARE", "DEFAULT", "DEFAULTS", "DEFERRABLE", "DEFERRED", "DEFINED", "DEFINER", "DEGREE", "DELETE", "DELIMITER", "DELIMITERS", "DENSE_RANK", "DEPTH", "DEREF", "DERIVED", "DESC", "DESCRIBE", "DESCRIPTOR", "DESTROY", "DESTRUCTOR", "DETERMINISTIC", "DIAGNOSTICS", "DICTIONARY", "DISABLE", "DISCONNECT", "DISPATCH", "DISTINCT", "DO", "DOMAIN", "DOUBLE", "DROP", "DYNAMIC", "DYNAMIC_FUNCTION", "DYNAMIC_FUNCTION_CODE", "EACH", "ELEMENT", "ELSE", "ENABLE", "ENCODING", "ENCRYPTED", "END", "END-EXEC", "EQUALS", "ESCAPE", "EVERY", "EXCEPT", "EXCEPTION", "EXCLUDE", "EXCLUDING", "EXCLUSIVE", "EXEC", "EXECUTE", "EXISTING", "EXISTS", "EXP", "EXPLAIN", "EXTERNAL", "EXTRACT", "FALSE", "FETCH", "FILTER", "FINAL", "FIRST", "FLOAT", "FLOOR", "FOLLOWING", "FOR", "FORCE", "FOREIGN", "FORTRAN", "FORWARD", "FOUND", "FREE", "FREEZE", "FROM", "FULL", "FUNCTION", "FUSION", "G", "GENERAL", "GENERATED", "GET", "GLOBAL", "GO", "GOTO", "GRANT", "GRANTED", "GREATEST", "GROUP", "GROUPING", "HANDLER", "HAVING", "HEADER", "HIERARCHY", "HOLD", "HOST", "HOUR", "IDENTITY", "IGNORE", "ILIKE", "IMMEDIATE", "IMMUTABLE", "IMPLEMENTATION", "IMPLICIT", "IN", "INCLUDING", "INCREMENT", "INDEX", "INDICATOR", "INFIX", "INHERIT", "INHERITS", "INITIALIZE", "INITIALLY", "INNER", "INOUT", "INPUT", "INSENSITIVE", "INSERT", "INSTANCE", "INSTANTIABLE", "INSTEAD", "INT", "INTEGER", "INTERSECT", "INTERSECTION", "INTERVAL", "INTO", "INVOKER", "IS", "ISNULL", "ISOLATION", "ITERATE", "JOIN", "K", "KEY", "KEY_MEMBER", "KEY_TYPE", "LANCOMPILER", "LANGUAGE", "LARGE", "LAST", "LATERAL", "LEADING", "LEAST", "LEFT", "LENGTH", "LESS", "LEVEL", "LIKE", "LIMIT", "LISTEN", "LN", "LOAD", "LOCAL", "LOCALTIME", "LOCALTIMESTAMP", "LOCATION", "LOCATOR", "LOCK", "LOGIN", "LOWER", "M", "MAP", "MATCH", "MATCHED", "MAX", "MAXVALUE", "MEMBER", "MERGE", "MESSAGE_LENGTH", "MESSAGE_OCTET_LENGTH", "MESSAGE_TEXT", "METHOD", "MIN", "MINUTE", "MINVALUE", "MOD", "MODE", "MODIFIES", "MODIFY", "MODULE", "MONTH", "MORE", "MOVE", "MULTISET", "MUMPS", "NAME", "NAMES", "NATIONAL", "NATURAL", "NCHAR", "NCLOB", "NESTING", "NEW", "NEXT", "NO", "NOCREATEDB", "NOCREATEROLE", "NOCREATEUSER", "NOINHERIT", "NOLOGIN", "NONE", "NORMALIZE", "NORMALIZED", "NOSUPERUSER", "NOT", "NOTHING", "NOTIFY", "NOTNULL", "NOWAIT", "NULL", "NULLABLE", "NULLIF", "NULLS", "NUMBER", "NUMERIC", "OBJECT", "OCTETS", "OCTET_LENGTH", "OF", "OFF", "OFFSET", "OIDS", "OLD", "ON", "ONLY", "OPEN", "OPERATION", "OPERATOR", "OPTION", "OPTIONS", "OR", "ORDER", "ORDERING", "ORDINALITY", "OTHERS", "OUT", "OUTER", "OUTPUT", "OVER", "OVERLAPS", "OVERLAY", "OVERRIDING", "OWNER", "PAD", "PARAMETER", "PARAMETERS", "PARAMETER_MODE", "PARAMETER_NAME", "PARAMETER_ORDINAL_POSITION", "PARAMETER_SPECIFIC_CATALOG", "PARAMETER_SPECIFIC_NAME", "PARAMETER_SPECIFIC_SCHEMA", "PARTIAL", "PARTITION", "PASCAL", "PASSWORD", "PATH", "PERCENTILE_CONT", "PERCENTILE_DISC", "PERCENT_RANK", "PLACING", "PLI", "POSITION", "POSTFIX", "POWER", "PRECEDING", "PRECISION", "PREFIX", "PREORDER", "PREPARE", "PREPARED", "PRESERVE", "PRIMARY", "PRIOR", "PRIVILEGES", "PROCEDURAL", "PROCEDURE", "PUBLIC", "QUOTE", "RANGE", "RANK", "READ", "READS", "REAL", "RECHECK", "RECURSIVE", "REF", "REFERENCES", "REFERENCING", "REGR_AVGX", "REGR_AVGY", "REGR_COUNT", "REGR_INTERCEPT", "REGR_R2", "REGR_SLOPE", "REGR_SXX", "REGR_SXY", "REGR_SYY", "REINDEX", "RELATIVE", "RELEASE", "RENAME", "REPEATABLE", "REPLACE", "RESET", "RESTART", "RESTRICT", "RESULT", "RETURN", "RETURNED_CARDINALITY", "RETURNED_LENGTH", "RETURNED_OCTET_LENGTH", "RETURNED_SQLSTATE", "RETURNS", "REVOKE", "RIGHT", "ROLE", "ROLLBACK", "ROLLUP", "ROUTINE", "ROUTINE_CATALOG", "ROUTINE_NAME", "ROUTINE_SCHEMA", "ROW", "ROWS", "ROW_COUNT", "ROW_NUMBER", "RULE", "SAVEPOINT", "SCALE", "SCHEMA", "SCHEMA_NAME", "SCOPE", "SCOPE_CATALOG", "SCOPE_NAME", "SCOPE_SCHEMA", "SCROLL", "SEARCH", "SECOND", "SECTION", "SECURITY", "SELECT", "SELF", "SENSITIVE", "SEQUENCE", "SERIALIZABLE", "SERVER_NAME", "SESSION", "SESSION_USER", "SET", "SETOF", "SETS", "SHARE", "SHOW", "SIMILAR", "SIMPLE", "SIZE", "SMALLINT", "SOME", "SOURCE", "SPACE", "SPECIFIC", "SPECIFICTYPE", "SPECIFIC_NAME", "SQL", "SQLCODE", "SQLERROR", "SQLEXCEPTION", "SQLSTATE", "SQLWARNING", "SQRT", "STABLE", "START", "STATE", "STATEMENT", "STATIC", "STATISTICS", "STDDEV_POP", "STDDEV_SAMP", "STDIN", "STDOUT", "STORAGE", "STRICT", "STRUCTURE", "STYLE", "SUBCLASS_ORIGIN", "SUBLIST", "SUBMULTISET", "SUBSTRING", "SUM", "SUPERUSER", "SYMMETRIC", "SYSID", "SYSTEM", "SYSTEM_USER", "TABLE", "TABLESAMPLE", "TABLESPACE", "TABLE_NAME", "TEMP", "TEMPLATE", "TEMPORARY", "TERMINATE", "THAN", "THEN", "TIES", "TIME", "TIMESTAMP", "TIMEZONE_HOUR", "TIMEZONE_MINUTE", "TO", "TOAST", "TOP_LEVEL_COUNT", "TRAILING", "TRANSACTION", "TRANSACTIONS_COMMITTED", "TRANSACTIONS_ROLLED_BACK", "TRANSACTION_ACTIVE", "TRANSFORM", "TRANSFORMS", "TRANSLATE", "TRANSLATION", "TREAT", "TRIGGER", "TRIGGER_CATALOG", "TRIGGER_NAME", "TRIGGER_SCHEMA", "TRIM", "TRUE", "TRUNCATE", "TRUSTED", "TYPE", "UESCAPE", "UNBOUNDED", "UNCOMMITTED", "UNDER", "UNENCRYPTED", "UNION", "UNIQUE", "UNKNOWN", "UNLISTEN", "UNNAMED", "UNNEST", "UNTIL", "UPDATE", "UPPER", "USAGE", "USER", "USER_DEFINED_TYPE_CATALOG", "USER_DEFINED_TYPE_CODE", "USER_DEFINED_TYPE_NAME", "USER_DEFINED_TYPE_SCHEMA", "USING", "VACUUM", "VALID", "VALIDATOR", "VALUE", "VALUES", "VARCHAR", "VARIABLE", "VARYING", "VAR_POP", "VAR_SAMP", "VERBOSE", "VIEW", "VOLATILE", "WHEN", "WHENEVER", "WHERE", "WIDTH_BUCKET", "WINDOW", "WITH", "WITHIN", "WITHOUT", "WORK", "WRITE", "YEAR", "ZONE"};
    }
    @Override
    public boolean supportsRepository() {
        return true;
    }
    @Override
    public String getSQLLockTables(String[] tableNames) {
        String sql = "LOCK TABLE ";

        for(int i = 0; i < tableNames.length; ++i) {
            if (i > 0) {
                sql = sql + ", ";
            }

            sql = sql + tableNames[i] + " ";
        }

        sql = sql + "IN ACCESS EXCLUSIVE MODE;" + Const.CR;
        return sql;
    }
    @Override
    public String getSQLUnlockTables(String[] tableName) {
        return null;
    }
    @Override
    public boolean isDefaultingToUppercase() {
        return false;
    }

    @Override
    public String[] getUsedLibraries() {
        return new String[]{"hive-jdbc-1.2.1.jar"};
    }
    @Override
    public boolean supportsErrorHandlingOnBatchUpdates() {
        return false;
    }
    @Override
    public String quoteSQLString(String string) {
        string = string.replaceAll("'", "''");
        string = string.replaceAll("\\n", "\\\\n");
        string = string.replaceAll("\\r", "\\\\r");
        return "E'" + string + "'";
    }
    @Override
    public boolean requiresCastToVariousForIsNull() {
        return true;
    }
    @Override
    public boolean supportsGetBlob() {
        return false;
    }
    @Override
    public boolean useSafePoints() {
        return true;
    }

    @Override public String getSQLInsertAutoIncUnknownDimensionRow( String schemaTable, String keyField,
                                                                    String versionField ) {
        return "insert into " + schemaTable + "(" + keyField + ", " + versionField + ") values (1, 1)";
    }

    /**
     *TODO 看源码得知,当有特殊字段时,会在insert into对应字段加上特殊字符,
     * 会生成如 insert into table12 (uid,name,"username","password") values(?,?,?,?);语句,执行会报错
     *
     * 重写此方法,将特殊字符去掉,不然会报错
     * @return
     */
    @Override public String getEndQuote() {
        return "";
    }

   
}

        备注: 使用的是5.1.0.0-752版本的jar包,由于没有源码,所以将这个变为class文件后,直接360压缩打开jar包,将class放到指定位置,并修改kettle-database-types.xml文件,添加HIVE2的节点。

        

二、在hive中创建表 

        新建一个源表(user_info_src)和一个目标表(user_info_dest)

CREATE TABLE user_info_src`(
  `id` string, 
  `name` string);
  
CREATE TABLE user_info_dest`(
  `id` string, 
  `name` string);

三、运行代码

        运行Hive到hive的交换,报错如下,这里是连不上hive的数据库,然后看应该是缺少hadoop的配置。

         处理方式是引入hadoop的jar包,

<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>2.7.3</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-hdfs</artifactId>
			<version>2.7.3</version>
		</dependency>

        继续报错,下面标出了报错的地方,是在getRowInfo方法中,

         跟踪这个方法是发现如下的isSigned方法报错,

         而isSigned方法调用的是hive的jdbc包中的方法,即如下图,

         hive并不支持isSigned方法,导致异常抛出,所以此处思考一个点,在调用isSigned的方法处捕获异常是不是能进行下去,然后去看下载的5.1的源码,里面已经处理了。

         那么此处思考将5.1.0.0-752版本的此处代码捕获异常后,重新编译成class是否可以成功呢?

        当然此处我还没试验,我直接将引用的kette包变为了9.3.0.0-157,如下图所示,这是我下载5.1源码打的jar包,

         当然这个jar包中有些类变了,比如JsonOutputField这个类,在8.1源码中是在plugins模块下,5.1直接没有了。

         换了jar包后继续运行,还是报错,提示jar包中缺少,kettle-password-encoder-plugins.xml文件,然后拷贝了8.1源码中的放过来,

         再次运行,哇,不报错了。

         但是,我发现源表我没有数据,所以上面是一次空交换,加入数据后交换呢?

        继续报错,这里应该是插入的时候报错了,又是method not supported,看样子很可能是hive的jdbc包又不支持什么方法,继续跟踪代码下去,

         继续跟踪代码,果然是hvie的不支持,此处的红线处是true,然后进入hive的源码,

 

         那就让 databaseMeta.supportsSetCharacterStream()  这个代码不为true呗,进入后发现,这个调用的是 Hive2SQLDatabaseMeta,这是我们自己写的,把如下图方法重构下,

         继续运行一次后,交换正常,数据也到目标表了

问题点

1. hive的插入数据很慢,用这种jdbc的insert语句,我观察了一下,我一条语句插入花了1s,我靠,这怎么行,

2. hive有时候select一条数据很慢,时间在5s~30s不等

        然后在想,kettle不支持hive这种交换是不是因为此问题,毕竟Hive其实是hadoop的文件系统的jdbc展示,说白了压根就不是和传统数据库那样的存储模式。

       后面然后就用直接操作hadoop文件系统的方式来实现交换,这些等后面有空再整理,总的来说,使用写文件,然后加载到hive里面,速度还是很快的。

感想

        有人问我,Kettle资料一大堆,为啥还写博客,不重复了么?

        别人的博客始终是别人的,自己写的博客是自己经历过的,是自己的思路,而且也是一种回顾,就和观后感一样,一千个人有一千种想法。

后续:

        我靠,我是直接clone代码的,然后忘记切换分支了,上述不是5.1的源码,而是master的源码,我了个去,难怪我说jar包版本怎么是9.3,阿西吧

posted @ 2021-08-19 10:23  伟衙内  阅读(259)  评论(0编辑  收藏  举报