Kettle大量数据快速导出的解决方案(利用SQL导出百万级数据,挺快的)

1.开发背景
在web项目中,经常会需要查询数据导出excel,以前比较常见的就是用poi。使用poi的时候也有两种方式,一种就是直接将集合一次性导出为excel,还有一种是分批次追加的方式适合数据量较大的情况。poi支持xls和xlsx,使用2003版本的只支持6万多行以下的数据量,使用2007版本的支持百万行。但是呢,当数据量大了之后这种方式却非常耗内存和时间。
接触了etl之后就想着用kettle来做导数据,经过测试是完全可行的。几十万行,一百万行都能快速导出来,代码也非常简单。
 
2.kettle相关maven依赖如下
 <dependency>
     <groupId>org.apache.commons</groupId>
     <artifactId>commons-vfs2</artifactId>
     <version>2.0</version>
 </dependency>
 <dependency>
     <groupId>org.scannotation</groupId>
     <artifactId>scannotation</artifactId>
     <version>1.0.3</version>
 </dependency>
 <dependency>
     <groupId>dom4j</groupId>
     <artifactId>dom4j</artifactId>
     <version>1.6.1</version>
 </dependency>
 <dependency>
     <groupId>pentaho-kettle</groupId>
     <artifactId>kettle-vfs</artifactId>
     <version>5.2.0.0</version>
     <classifier>pentaho</classifier>
 </dependency>
 <dependency>
     <groupId>pentaho-kettle</groupId>
     <artifactId>kettle-engine</artifactId>
     <version>5.2.0.0</version>
 </dependency>
 <dependency>
     <groupId>pentaho-kettle</groupId>
     <artifactId>kettle-core</artifactId>
     <version>5.2.0.0</version>
 </dependency>

OK,准备工作完成后开始编写Kettle工具类:

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.PrintWriter;
import java.net.URL;
import java.net.URLDecoder;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletResponse;
import org.apache.struts2.ServletActionContext;
import org.pentaho.di.core.KettleEnvironment;
import org.pentaho.di.trans.Trans;
import org.pentaho.di.trans.TransMeta;

public class Kettle {
    public Kettle() {
    }

    public static InputStream export(String sql, String savePath, String hostIp, String dbName, String port, String userName, String password) {
        FileInputStream in = null;

        try {
            HttpServletResponse response = ServletActionContext.getResponse();
            URL url = Kettle.class.getResource("/");
            String paths = url.getFile().substring(0, url.getFile().indexOf("classes"));
            paths = URLDecoder.decode(paths);
            Long times = System.currentTimeMillis();
            File path = new File(savePath);
            if (!path.exists() && !path.isDirectory()) {
                path.mkdir();
                path.canWrite();
            }

            KettleEnvironment.init();
            TransMeta tm = new TransMeta(paths + "classes/" + "export.ktr");
            Trans trans = new Trans(tm);
            trans.setVariable("hostIp", hostIp);
            trans.setVariable("dbName", dbName);
            trans.setVariable("port", port);
            trans.setVariable("userName", userName);
            trans.setVariable("password", password);
            trans.setVariable("sql", sql);
            trans.setVariable("savePath", savePath + "/" + times);
            trans.prepareExecution((String[])null);
            trans.startThreads();
            trans.waitUntilFinished();
            File file = new File(savePath + "/" + times + ".xlsx");
            File file2 = new File(savePath + "/" + times + ".xls");
            ServletOutputStream out;
            if (file.exists()) {
                in = new FileInputStream(file);
                response.reset();
                response.setContentType("application/x-download;charset=UTF-8");
                response.addHeader("Content-Disposition", "attachment;filename='" + times + ".xlsx'");
                response.addHeader("Content-Length", String.valueOf(file.length()));
                out = response.getOutputStream();
                writeFile(out, in);
                file.delete();
            } else if (file2.exists()) {
                in = new FileInputStream(file2);
                response.reset();
                response.setContentType("application/x-download;charset=UTF-8");
                response.addHeader("Content-Disposition", "attachment;filename='" + times + ".xls'");
                response.addHeader("Content-Length", String.valueOf(file2.length()));
                out = response.getOutputStream();
                writeFile(out, in);
                file2.delete();
            } else {
                response.setContentType("text/html;charset=UTF-8");
                PrintWriter writer = response.getWriter();
                writer.write("<script>alert('文件不存在!');window.close();</script>");
            }
        } catch (Exception var18) {
            var18.printStackTrace();
        }

        return in;
    }

    public static void writeFile(OutputStream fos, InputStream fis) throws IOException {
        byte[] buffer = new byte[1024];
        boolean var3 = false;

        int len;
        while((len = fis.read(buffer)) > 0) {
            fos.write(buffer, 0, len);
            fos.flush();
        }

        fis.close();
        fos.close();
    }
}

以上工具类要读取一个export.ktr的文件,我们把它放在\src\main\java下(当然你也可以放在你想要的路径下,修改以上程序能获取到就是了),记得打包不要排除了,不然找不到

export.ktr文件内容(拷贝我的就行,如果不能使用自己去网上找个,百度下里面参数意思,数据库我用的mysql):

<?xml version="1.0" encoding="UTF-8"?>
<transformation>
    <info>
        <name>export</name>
        <description/>
        <extended_description/>
        <trans_version/>
        <trans_type>Normal</trans_type>
        <trans_status>0</trans_status>
        <directory>/</directory>
        <parameters>
        </parameters>
        <log>
            <trans-log-table>
                <connection/>
                <schema/>
                <table/>
                <size_limit_lines/>
                <interval/>
                <timeout_days/>
                <field>
                    <id>ID_BATCH</id>
                    <enabled>Y</enabled>
                    <name>ID_BATCH</name>
                </field>
                <field>
                    <id>CHANNEL_ID</id>
                    <enabled>Y</enabled>
                    <name>CHANNEL_ID</name>
                </field>
                <field>
                    <id>TRANSNAME</id>
                    <enabled>Y</enabled>
                    <name>TRANSNAME</name>
                </field>
                <field>
                    <id>STATUS</id>
                    <enabled>Y</enabled>
                    <name>STATUS</name>
                </field>
                <field>
                    <id>LINES_READ</id>
                    <enabled>Y</enabled>
                    <name>LINES_READ</name>
                    <subject/>
                </field>
                <field>
                    <id>LINES_WRITTEN</id>
                    <enabled>Y</enabled>
                    <name>LINES_WRITTEN</name>
                    <subject/>
                </field>
                <field>
                    <id>LINES_UPDATED</id>
                    <enabled>Y</enabled>
                    <name>LINES_UPDATED</name>
                    <subject/>
                </field>
                <field>
                    <id>LINES_INPUT</id>
                    <enabled>Y</enabled>
                    <name>LINES_INPUT</name>
                    <subject/>
                </field>
                <field>
                    <id>LINES_OUTPUT</id>
                    <enabled>Y</enabled>
                    <name>LINES_OUTPUT</name>
                    <subject/>
                </field>
                <field>
                    <id>LINES_REJECTED</id>
                    <enabled>Y</enabled>
                    <name>LINES_REJECTED</name>
                    <subject/>
                </field>
                <field>
                    <id>ERRORS</id>
                    <enabled>Y</enabled>
                    <name>ERRORS</name>
                </field>
                <field>
                    <id>STARTDATE</id>
                    <enabled>Y</enabled>
                    <name>STARTDATE</name>
                </field>
                <field>
                    <id>ENDDATE</id>
                    <enabled>Y</enabled>
                    <name>ENDDATE</name>
                </field>
                <field>
                    <id>LOGDATE</id>
                    <enabled>Y</enabled>
                    <name>LOGDATE</name>
                </field>
                <field>
                    <id>DEPDATE</id>
                    <enabled>Y</enabled>
                    <name>DEPDATE</name>
                </field>
                <field>
                    <id>REPLAYDATE</id>
                    <enabled>Y</enabled>
                    <name>REPLAYDATE</name>
                </field>
                <field>
                    <id>LOG_FIELD</id>
                    <enabled>Y</enabled>
                    <name>LOG_FIELD</name>
                </field>
                <field>
                    <id>EXECUTING_SERVER</id>
                    <enabled>N</enabled>
                    <name>EXECUTING_SERVER</name>
                </field>
                <field>
                    <id>EXECUTING_USER</id>
                    <enabled>N</enabled>
                    <name>EXECUTING_USER</name>
                </field>
                <field>
                    <id>CLIENT</id>
                    <enabled>N</enabled>
                    <name>CLIENT</name>
                </field>
            </trans-log-table>
            <perf-log-table>
                <connection/>
                <schema/>
                <table/>
                <interval/>
                <timeout_days/>
                <field>
                    <id>ID_BATCH</id>
                    <enabled>Y</enabled>
                    <name>ID_BATCH</name>
                </field>
                <field>
                    <id>SEQ_NR</id>
                    <enabled>Y</enabled>
                    <name>SEQ_NR</name>
                </field>
                <field>
                    <id>LOGDATE</id>
                    <enabled>Y</enabled>
                    <name>LOGDATE</name>
                </field>
                <field>
                    <id>TRANSNAME</id>
                    <enabled>Y</enabled>
                    <name>TRANSNAME</name>
                </field>
                <field>
                    <id>STEPNAME</id>
                    <enabled>Y</enabled>
                    <name>STEPNAME</name>
                </field>
                <field>
                    <id>STEP_COPY</id>
                    <enabled>Y</enabled>
                    <name>STEP_COPY</name>
                </field>
                <field>
                    <id>LINES_READ</id>
                    <enabled>Y</enabled>
                    <name>LINES_READ</name>
                </field>
                <field>
                    <id>LINES_WRITTEN</id>
                    <enabled>Y</enabled>
                    <name>LINES_WRITTEN</name>
                </field>
                <field>
                    <id>LINES_UPDATED</id>
                    <enabled>Y</enabled>
                    <name>LINES_UPDATED</name>
                </field>
                <field>
                    <id>LINES_INPUT</id>
                    <enabled>Y</enabled>
                    <name>LINES_INPUT</name>
                </field>
                <field>
                    <id>LINES_OUTPUT</id>
                    <enabled>Y</enabled>
                    <name>LINES_OUTPUT</name>
                </field>
                <field>
                    <id>LINES_REJECTED</id>
                    <enabled>Y</enabled>
                    <name>LINES_REJECTED</name>
                </field>
                <field>
                    <id>ERRORS</id>
                    <enabled>Y</enabled>
                    <name>ERRORS</name>
                </field>
                <field>
                    <id>INPUT_BUFFER_ROWS</id>
                    <enabled>Y</enabled>
                    <name>INPUT_BUFFER_ROWS</name>
                </field>
                <field>
                    <id>OUTPUT_BUFFER_ROWS</id>
                    <enabled>Y</enabled>
                    <name>OUTPUT_BUFFER_ROWS</name>
                </field>
            </perf-log-table>
            <channel-log-table>
                <connection/>
                <schema/>
                <table/>
                <timeout_days/>
                <field>
                    <id>ID_BATCH</id>
                    <enabled>Y</enabled>
                    <name>ID_BATCH</name>
                </field>
                <field>
                    <id>CHANNEL_ID</id>
                    <enabled>Y</enabled>
                    <name>CHANNEL_ID</name>
                </field>
                <field>
                    <id>LOG_DATE</id>
                    <enabled>Y</enabled>
                    <name>LOG_DATE</name>
                </field>
                <field>
                    <id>LOGGING_OBJECT_TYPE</id>
                    <enabled>Y</enabled>
                    <name>LOGGING_OBJECT_TYPE</name>
                </field>
                <field>
                    <id>OBJECT_NAME</id>
                    <enabled>Y</enabled>
                    <name>OBJECT_NAME</name>
                </field>
                <field>
                    <id>OBJECT_COPY</id>
                    <enabled>Y</enabled>
                    <name>OBJECT_COPY</name>
                </field>
                <field>
                    <id>REPOSITORY_DIRECTORY</id>
                    <enabled>Y</enabled>
                    <name>REPOSITORY_DIRECTORY</name>
                </field>
                <field>
                    <id>FILENAME</id>
                    <enabled>Y</enabled>
                    <name>FILENAME</name>
                </field>
                <field>
                    <id>OBJECT_ID</id>
                    <enabled>Y</enabled>
                    <name>OBJECT_ID</name>
                </field>
                <field>
                    <id>OBJECT_REVISION</id>
                    <enabled>Y</enabled>
                    <name>OBJECT_REVISION</name>
                </field>
                <field>
                    <id>PARENT_CHANNEL_ID</id>
                    <enabled>Y</enabled>
                    <name>PARENT_CHANNEL_ID</name>
                </field>
                <field>
                    <id>ROOT_CHANNEL_ID</id>
                    <enabled>Y</enabled>
                    <name>ROOT_CHANNEL_ID</name>
                </field>
            </channel-log-table>
            <step-log-table>
                <connection/>
                <schema/>
                <table/>
                <timeout_days/>
                <field>
                    <id>ID_BATCH</id>
                    <enabled>Y</enabled>
                    <name>ID_BATCH</name>
                </field>
                <field>
                    <id>CHANNEL_ID</id>
                    <enabled>Y</enabled>
                    <name>CHANNEL_ID</name>
                </field>
                <field>
                    <id>LOG_DATE</id>
                    <enabled>Y</enabled>
                    <name>LOG_DATE</name>
                </field>
                <field>
                    <id>TRANSNAME</id>
                    <enabled>Y</enabled>
                    <name>TRANSNAME</name>
                </field>
                <field>
                    <id>STEPNAME</id>
                    <enabled>Y</enabled>
                    <name>STEPNAME</name>
                </field>
                <field>
                    <id>STEP_COPY</id>
                    <enabled>Y</enabled>
                    <name>STEP_COPY</name>
                </field>
                <field>
                    <id>LINES_READ</id>
                    <enabled>Y</enabled>
                    <name>LINES_READ</name>
                </field>
                <field>
                    <id>LINES_WRITTEN</id>
                    <enabled>Y</enabled>
                    <name>LINES_WRITTEN</name>
                </field>
                <field>
                    <id>LINES_UPDATED</id>
                    <enabled>Y</enabled>
                    <name>LINES_UPDATED</name>
                </field>
                <field>
                    <id>LINES_INPUT</id>
                    <enabled>Y</enabled>
                    <name>LINES_INPUT</name>
                </field>
                <field>
                    <id>LINES_OUTPUT</id>
                    <enabled>Y</enabled>
                    <name>LINES_OUTPUT</name>
                </field>
                <field>
                    <id>LINES_REJECTED</id>
                    <enabled>Y</enabled>
                    <name>LINES_REJECTED</name>
                </field>
                <field>
                    <id>ERRORS</id>
                    <enabled>Y</enabled>
                    <name>ERRORS</name>
                </field>
                <field>
                    <id>LOG_FIELD</id>
                    <enabled>N</enabled>
                    <name>LOG_FIELD</name>
                </field>
            </step-log-table>
            <metrics-log-table>
                <connection/>
                <schema/>
                <table/>
                <timeout_days/>
                <field>
                    <id>ID_BATCH</id>
                    <enabled>Y</enabled>
                    <name>ID_BATCH</name>
                </field>
                <field>
                    <id>CHANNEL_ID</id>
                    <enabled>Y</enabled>
                    <name>CHANNEL_ID</name>
                </field>
                <field>
                    <id>LOG_DATE</id>
                    <enabled>Y</enabled>
                    <name>LOG_DATE</name>
                </field>
                <field>
                    <id>METRICS_DATE</id>
                    <enabled>Y</enabled>
                    <name>METRICS_DATE</name>
                </field>
                <field>
                    <id>METRICS_CODE</id>
                    <enabled>Y</enabled>
                    <name>METRICS_CODE</name>
                </field>
                <field>
                    <id>METRICS_DESCRIPTION</id>
                    <enabled>Y</enabled>
                    <name>METRICS_DESCRIPTION</name>
                </field>
                <field>
                    <id>METRICS_SUBJECT</id>
                    <enabled>Y</enabled>
                    <name>METRICS_SUBJECT</name>
                </field>
                <field>
                    <id>METRICS_TYPE</id>
                    <enabled>Y</enabled>
                    <name>METRICS_TYPE</name>
                </field>
                <field>
                    <id>METRICS_VALUE</id>
                    <enabled>Y</enabled>
                    <name>METRICS_VALUE</name>
                </field>
            </metrics-log-table>
        </log>
        <maxdate>
            <connection/>
            <table/>
            <field/>
            <offset>0.0</offset>
            <maxdiff>0.0</maxdiff>
        </maxdate>
        <size_rowset>10000</size_rowset>
        <sleep_time_empty>50</sleep_time_empty>
        <sleep_time_full>50</sleep_time_full>
        <unique_connections>N</unique_connections>
        <feedback_shown>Y</feedback_shown>
        <feedback_size>50000</feedback_size>
        <using_thread_priorities>Y</using_thread_priorities>
        <shared_objects_file/>
        <capture_step_performance>N</capture_step_performance>
        <step_performance_capturing_delay>1000</step_performance_capturing_delay>
        <step_performance_capturing_size_limit>100</step_performance_capturing_size_limit>
        <dependencies>
        </dependencies>
        <partitionschemas>
        </partitionschemas>
        <slaveservers>
        </slaveservers>
        <clusterschemas>
        </clusterschemas>
        <created_user>-</created_user>
        <created_date>2016/03/01 17:21:47.083</created_date>
        <modified_user>-</modified_user>
        <modified_date>2016/09/01 09:29:12.684</modified_date>
        <key_for_session_key>H4sIAAAAAAAAAAMAAAAAAAAAAAA=</key_for_session_key>
        <is_key_private>N</is_key_private>
    </info>
    <notepads>
    </notepads>
    <connection>
        <name>111</name>
        <server>${hostIp}</server>
        <type>MYSQL</type>
        <access>Native</access>
        <database>${dbName}</database>
        <port>${port}</port>
        <username>${userName}</username>
        <password>${password}</password>
        <servername/>
        <data_tablespace/>
        <index_tablespace/>
        <attributes>
            <attribute>
                <code>EXTRA_OPTION_MYSQL.defaultFetchSize</code>
                <attribute>500</attribute>
            </attribute>
            <attribute>
                <code>EXTRA_OPTION_MYSQL.useCursorFetch</code>
                <attribute>true</attribute>
            </attribute>
            <attribute>
                <code>FORCE_IDENTIFIERS_TO_LOWERCASE</code>
                <attribute>N</attribute>
            </attribute>
            <attribute>
                <code>FORCE_IDENTIFIERS_TO_UPPERCASE</code>
                <attribute>N</attribute>
            </attribute>
            <attribute>
                <code>IS_CLUSTERED</code>
                <attribute>N</attribute>
            </attribute>
            <attribute>
                <code>PORT_NUMBER</code>
                <attribute>${port}</attribute>
            </attribute>
            <attribute>
                <code>PRESERVE_RESERVED_WORD_CASE</code>
                <attribute>N</attribute>
            </attribute>
            <attribute>
                <code>QUOTE_ALL_FIELDS</code>
                <attribute>N</attribute>
            </attribute>
            <attribute>
                <code>STREAM_RESULTS</code>
                <attribute>Y</attribute>
            </attribute>
            <attribute>
                <code>SUPPORTS_BOOLEAN_DATA_TYPE</code>
                <attribute>Y</attribute>
            </attribute>
            <attribute>
                <code>SUPPORTS_TIMESTAMP_DATA_TYPE</code>
                <attribute>Y</attribute>
            </attribute>
            <attribute>
                <code>USE_POOLING</code>
                <attribute>N</attribute>
            </attribute>
        </attributes>
    </connection>
    <order>
        <hop>
            <from>表输入</from>
            <to>Excel输出</to>
            <enabled>Y</enabled>
        </hop>
    </order>
    <step>
        <name>Excel输出</name>
        <type>ExcelOutput</type>
        <description/>
        <distribute>Y</distribute>
        <custom_distribution/>
        <copies>1</copies>
        <partitioning>
            <method>none</method>
            <schema_name/>
        </partitioning>
        <header>Y</header>
        <footer>N</footer>
        <encoding>UTF-8</encoding>
        <append>N</append>
        <add_to_result_filenames>Y</add_to_result_filenames>
        <file>
            <name>${savePath}</name>
            <extention>xls</extention>
            <do_not_open_newfile_init>N</do_not_open_newfile_init>
            <create_parent_folder>N</create_parent_folder>
            <split>N</split>
            <add_date>N</add_date>
            <add_time>N</add_time>
            <SpecifyFormat>N</SpecifyFormat>
            <date_time_format/>
            <sheetname>Sheet1</sheetname>
            <autosizecolums>Y</autosizecolums>
            <nullisblank>N</nullisblank>
            <protect_sheet>N</protect_sheet>
            <password>Encrypted</password>
            <splitevery>0</splitevery>
            <usetempfiles>N</usetempfiles>
            <tempdirectory/>
        </file>
        <template>
            <enabled>N</enabled>
            <append>N</append>
            <filename>C:\Users\47\Desktop\通用模板.xls</filename>
        </template>
        <fields>
        </fields>
        <custom>
            <header_font_name>arial</header_font_name>
            <header_font_size>10</header_font_size>
            <header_font_bold>Y</header_font_bold>
            <header_font_italic>N</header_font_italic>
            <header_font_underline>no</header_font_underline>
            <header_font_orientation>horizontal</header_font_orientation>
            <header_font_color>white</header_font_color>
            <header_background_color>dark_teal</header_background_color>
            <header_row_height>320</header_row_height>
            <header_alignment>center</header_alignment>
            <header_image/>
            <row_font_name>arial</row_font_name>
            <row_font_size>10</row_font_size>
            <row_font_color>black</row_font_color>
            <row_background_color>none</row_background_color>
        </custom>
        <cluster_schema/>
        <remotesteps>
            <input></input>
            <output></output>
        </remotesteps>
        <GUI>
            <xloc>544</xloc>
            <yloc>176</yloc>
            <draw>Y</draw>
        </GUI>
    </step>

    <step>
        <name>表输入</name>
        <type>TableInput</type>
        <description/>
        <distribute>N</distribute>
        <custom_distribution/>
        <copies>1</copies>
        <partitioning>
            <method>none</method>
            <schema_name/>
        </partitioning>
        <connection>111</connection>
        <sql>${sql}</sql>
        <limit>0</limit>
        <lookup/>
        <execute_each_row>N</execute_each_row>
        <variables_active>Y</variables_active>
        <lazy_conversion_active>N</lazy_conversion_active>
        <cluster_schema/>
        <remotesteps>
            <input></input>
            <output></output>
        </remotesteps>
        <GUI>
            <xloc>272</xloc>
            <yloc>176</yloc>
            <draw>Y</draw>
        </GUI>
    </step>

    <step_error_handling>
    </step_error_handling>
    <slave-step-copy-partition-distribution>
    </slave-step-copy-partition-distribution>
    <slave_transformation>N</slave_transformation>

</transformation>

主方法里调用:

Kettle.export(sql, SystemConstant.EXPORT_PATH, "数据库IP", "数据库名字", "数据库端口", "数据库账号", "数据库密码");

你也可以用Springboot整合,有对应maven及配置,自行查资料。

效率测试:

*导出10w行记录 

        执行时间: 1133ms

        执行时间: 1082ms 

        执行时间: 1096ms

* 导出100w行记录  

            执行时间: 39784ms

            执行时间: 8566ms 

            执行时间: 8622ms 
* Excel 2007行数极限 1048575 执行时间: 9686ms 

第一次导数据要加载kettle组件运行稍慢,后面几次再导数据速度就飞快了,更多结果有兴趣的可以去试试。

posted @ 2020-11-17 18:01  47号Gamer丶  阅读(3048)  评论(0编辑  收藏  举报