使用sqoop过程

With Sqoop, you can import data from a relational database system or a mainframe(主机) into HDFS. The input(投入) to the import process is either database table or mainframe datasets. For databases, Sqoop will read the table row-by-row into HDFS. For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS. The output(输出) of this import process is a set of files containing a copy of the imported table or datasets. The import process is performed in parallel(平行线). For this reason, the output will be in multiple files. These files may be delimited(划界) text files (for example, with commas or tabs separating each field), or binary(二进制的) Avro or SequenceFiles containing serialized(序列化) record data.
在Sqoop,你可以从关系型数据库或主机中导入数据到HDFS,导入过程的输入的数据要么是数据库表,要么是大型机数据集。如果是数据库,sqoop将以row-by-row的方式写进hdfs,如果是大型机的数据集,sqoop将在读取数据集中每条集合到hdfs。此导入过程是输出一组包含导入表或数据集副本的文件。这个导入过程是并行执行的。基于这个原因,输出的时候会在多个文件中。这些文件应该可能会分隔文本文件(例如,会以逗号或者tabs分割开每个field),或者binary Avro 或者 序列文件包括序列化的数据记录


A by-product of the import process is a generated(生成的) Java class which can encapsulate(压缩) one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent(后来的) MapReduce processing of the data. This class can serialize and deserialize(并行化) data to and from the SequenceFile format. It can also parse(解析) the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline(管道). You are also free to parse the delimiteds record data yourself, using any other tools you prefer.
导入过程的副产物是生成一个能压缩导入的数据表中一行java类,这个类在导入过程中由Sqoop自身使用。还向您提供了该类的Java源代码,用于数据的后续MapReduce处理。这个类可以序列化和反序列化数据到Sequence文件格式。它还可以解析带分隔符内容文件的记录。这些功能允许您快速开发MapReduce应用程序,这个应用程序在处理管道中使用hdfs存储的记录的。您也可以使用您喜欢的任何其他工具自行解析分隔记录数据。


After manipulating(操纵) the imported records (for example, with MapReduce or Hive) you may have a result data set which you can then export back to the relational database. Sqoop’s export process will read a set of delimited text files from HDFS in parallel, parse them into records, and insert them as new rows in a target database table, for consumption by external a pplications or users.
在操作导入的记录(例如,使用MapReduce或Hive)之后,您将有一个结果数据集,然后可以将其导出回关系数据库。sqoop的导出过程将并行地从HDFS读取一组分隔的文本文件 ,将它们解析为记录,并将它们作为新行插入目标数据库表中,供外部应用程序或用户使用


Sqoop includes some other commands which allow you to inspect the database you are working with. For example, you can list the available database schemas (with the sqoop-list-databases tool) and tables within a schema (with the sqoop-list-tables tool). Sqoop also includes a primitive(原始的) SQL execution(执行) shell(剥皮) (the sqoop-eval tool).
Sqoop包括一些其他命令,这些命令允许您检查正在使用的数据库。例如,可以列出可用的数据库集合(使用sqoop-list-database工具)和集合中的表(使用sqoop-list-table工具)。sqoop还包括一个基本的SQL执行shell(sqoop-val工具)。


Most aspects of the import, code generation, and export processes can be customized. For databases, you can control the specific row range or columns imported. You can specify particular delimiters(指定特定的分隔符) and escape characters(转义字符) for the file-based representation of the data, as well as the file format used. You can also control the class or package names used in generated(生成的) code. Subsequent(后来的) sections of this document explain how to specify these and other
大多数的导入、代码生成和导出过程的都可以定制。对于数据库,可以控制导入的特定行范围或列。可以为基于文件的数据表示指定特定的分隔符和转义字符,以及文件使用的格式。还可以控制生成代码中使用的类或包名称。 本文档的后续部分将解释如何指定这些和其他方面。

posted @ 2018-04-03 16:23  lfm601508022  阅读(431)  评论(0编辑  收藏  举报