ETL from hadoop to vertica
根据项目需要,我做了一个POC(proof of concept),XML TXT的数据从HADOOP
引入到VERTICA.
我采用的方案是pig,具体信息可以参加vertica官方的文档.
Access hp vertica from pig:
https://my.vertica.com/docs/6.1.x/HTML/index.htm#18525.htm
我用virtualbox 在本机搭建了一个hadoop集群(apache version) ,从vertica 官网下载了一个
虚拟机(开放文件格式的那版),然后导入到virtualbox,并且使这四台机器都在一个网段.
我测试了把txt 和xml文件类型导入到vertica.
还有就是从vertica load数据到hdfs.
从vertica获取数据:
A =LOAD 'sql://{select * from customer_dimension limit 100;}' USING com.vertica.pig.VerticaLoader('vertica','VMart','5433','dbadmin','password')
把XML文件从HDFS ETL 到 vertica的pig latin 代码
Register ' /home/hadoop/pig-0.12.0/contrib/piggybank/java/piggybank.jar;' REGISTER '/home/hadoop/pig-0.12.0/lib/vertica-jdk5-6.1.3-0.jar'; REGISTER '/home/hadoop/pig-0.12.0/lib/pig-vertica.jar'; xml = load '/tmp/xmldemo.xml' using org.apache.pig.piggybank.storage.XMLLoader('Step') as(step:chararray); rows = foreach xml GENERATE REGEX_EXTRACT(step,'StepName="(.*?)"',1) as stepname:chararray, REGEX_EXTRACT(step,'StepDescription="(.*?)"',1) as stepdesc:chararray, REGEX_EXTRACT(step,'TestDateTimeStarted="(.*?)"',1) as timestart:chararray, REGEX_EXTRACT(step,'GUIResponseTime="(.*?)"',1) as guitime:chararray, REGEX_EXTRACT(step,'StepNumber="(.*?)"',1) as stepnumber:chararray, REGEX_EXTRACT(step,'TestResult="(.*?)"',1) as result:chararray, REGEX_EXTRACT(step,'TestElapsedTimeSec="(.*?)"',1) as testtimetotal:chararray STORE rows INTO '{xmltable(stepname varchar,stepdesc varchar,timestart varchar,guiresponsetime varchar,stepnumber int,tsetresult varchar,totalseconds int)}' USING com.vertica.pig.VerticaStorer('vertica','VMart','5433', 'dbadmin','password');
Looking for a job working at Home about MSBI