本文来自与作者阅读 Programming Pig 所做的笔记,转载请注明出处 http://www.cnblogs.com/siwei1988/archive/2012/08/06/2624912.html
。Pig Latin是一种数据流语言,变量的命名规则同java中变量的命名规则,变量名可以复用(不建议这样做,这种情况下相当与新建一个变量,同时删除原来的变量)
A = load 'NYSE_dividends' (exchange, symbol, date, dividends); A = filter A by dividends > 0; A = foreach A generate UPPER(symbol);
。注释:--单行注释;/*……*/多行注释;
。Pig Latin关键词不区分大小写,比如load,foreach,但是变量名和udf区分大小写,COUNT是udf,所以不同于count。
。Load 加载数据
默认加载当前用户的home目录(/users/yourlogin
),可以在grunt下输入cd 命令更改当前所在目录。
divs = load '/data/examples/NYSE_dividends'
也可以输入完整的文件名
divs = load ‘hdfs://nn.acme.com/data/examples/NYSE_dividends’
默认使用TAB(\t)作为分割符,也可以使用using定义其它的分割符
divs = load 'NYSE_dividends' using PigStorage(',');
注意:只能用一个字符作为分割符
还可以使用using定义其它的加载函数
divs = load 'NYSE_dividends' using HBaseStorage();
as用于定义模式
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
也可以使用通配符加载一个目录下的所有文件,该目录下的所有子目录的文件也会被加载。通配符由hadoop文件系统决定,下面是hadoop 0.20所支持的通配符
glob | comment |
---|---|
? | Matches any single character. |
* | Matches zero or more characters. |
[abc] | Matches a single character from character set (a,b,c). |
[a-z] | Matches a single character from the character range (a..z), inclusive. The first character must be lexicographically less than or equal to the second character. |
[^abc] | Matches a single character that is not in the character set (a, b, c). The ^ character must occur immediately to the right of the opening bracket. |
[^a-z] | Matches a single character that is not from the character range (a..z) inclusive. The ^ character must occur immediately to the right of the opening bracket. |
\c | Removes (escapes) any special meaning of character c. |
{ab,cd} | Matches a string from the string set {ab, cd} |
。as 定义模式,可用于load ** [as (ColumnName[:type])],foreach…generate ColumnName [as newColumnName]
。store存储数据,默认用using PigStorage 使用tab作为分割符。
store processed into '/data/examples/processed';
也可以输入完整路径比如hdfs://nn.acme.com/data/examples/processed
.
可以使用using调用其它存储函数或其它分割符
store processed into 'processed' using HBaseStorage();
store processed into 'processed' using PigStorage(',');
注意:数据存储并不是存储为一个文件,而是由reduce进程数决定的多个part文件。
。foreach…generate[*][begin .. end]
*匹配所有,同样适用与udf;
..匹配begin和end之间的部分,包括begin和end
prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); beginning = foreach prices generate ..open; -- produces exchange, symbol, date, open middle = foreach prices generate open..close; -- produces open, high, low, close end = foreach prices generate volume..; -- produces volume, adj_close
一般情况下foreach…generate…重新生成的模式中的数据名和数据类型保持原来的名字和数据类型,但是如果有表达式则不会,可以在generate 变量后使用as关键词定义别名;
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); sym = foreach divs generate symbol; describe sym; sym: {symbol: chararray}
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); in_cents = foreach divs generate dividends * 100.0 as dividend, dividends * 100.0; describe in_cents; in_cents: {dividend: double,double}
#用于map查找;.用于tuple(元组)投影;
bball = load 'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); avg = foreach bball generate bat#'batting_average';
A = load 'input' as (t:tuple(x:int, y:int)); B = foreach A generate t.x, t.$1;
3.获取bag(包)中的数据
A = load 'input' as (b:bag{t:(x:int, y:int)}); B = foreach A generate b.x;
A = load 'input' as (b:bag{t:(x:int, y:int)}); B = foreach A generate b.(x, y);
下面的语句将执行不了
A = load 'foo' as (x:chararray, y:int, z:int); B = group A by x; -- produces bag A containing all the records for a given value of x C = foreach B generate SUM(A.y + A.z);
因为A.y 和 A.z都是bag,符号+对于bag不适用。
正确的做法如下
A = load 'foo' as (x:chararray, y:int, z:int); A1 = foreach A generate x, y + z as yz; B = group A1 by x; C = foreach B generate SUM(A1.yz);
。foreach中嵌套其它语句
--distinct_symbols.pig daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in other fields grpd = group daily by exchange; uniqcnt = foreach grpd { sym = daily.symbol; uniq_sym = distinct sym; generate group, COUNT(uniq_sym); };
注意:foreach内部只支持distinct
, filter
, limit
, order关键词;最后一句必须是generate;
--double_distinct.pig divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray); grpd = group divs all; uniq = foreach grpd { exchanges = divs.exchange; uniq_exchanges = distinct exchanges; symbols = divs.symbol; uniq_symbols = distinct symbols; generate COUNT(uniq_exchanges), COUNT(uniq_symbols); };
。flatten消除包嵌套关系
--flatten.pig players = load 'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); pos = foreach players generate name, flatten(position) as position; bypos = group pos by position;
--flatten_noempty.pig players = load 'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); noempty = foreach players generate name, ((position is null or IsEmpty(position)) ? {('unknown')} : position) as position; pos = foreach noempty generate name, flatten(position) as position; bypos = group pos by position;
。filter (注:pig中的逻辑语句同样遵循短路原则)
注意:null == 任何数据
。filter结合matches使用正则表达式(matches前加not表示不匹配)
pig中的正则表达式格式和java中的正则表达所一样,参考 http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
各种转义字符,转义字符使用方式:\\后面跟上转义码
点的转义:. ==> u002E 美元符号的转义:$ ==> u0024 乘方符号的转义:^ ==> u005E 左大括号的转义:{ ==> u007B 左方括号的转义:[ ==> u005B 左圆括号的转义:( ==> u0028 竖线的转义:| ==> u007C 右圆括号的转义:) ==> u0029 星号的转义:* ==> u002A 加号的转义:+ ==> u002B 问号的转义:? ==> u003F 反斜杠的转义: ==> u005C
下面的例子查找包括CM.的记录
-- filter_not_matches.pig divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); notstartswithcm = filter divs by not symbol matches '.*CM\\2u002E1.*';
。group之后的数据是一个map,其中key是group所用的键值,value是group针对的变量;
可用()同时对多个变量作group,group…all用于所有变量(注意:使用all时没有by),group之后的变量分为两个部分,第一部分变量名是group(不能更改),第二部是和原始bag模式一样的bag。
--twokey.pig
daily = load 'NYSE_daily' as (exchange, stock, date, dividends);
grpd = group daily by (exchange, stock);
avg = foreach grpd generate group, AVG(daily.dividends);
describe grpd;
grpd: {group: (exchange: bytearray,stock: bytearray),daily: {exchange: bytearray,
stock: bytearray,date: bytearray,dividends: bytearray}}
--countall.pig daily = load 'NYSE_daily' as (exchange, stock); grpd = group daily all; cnt = foreach grpd generate COUNT(daily);
。cogroup对多个变量进行group
注意:所有key值为null的数据都被归为同一类,这一点和group相同,和join不同。
A = load 'input1' as (id:int, val:float); B = load 'input2' as (id:int, val2:int); C = cogroup A by id, B by id; describe C; C: {group: int,A: {id: int,val: float},B: {id: int,val2: int}}
。order by
对单列进行排序
--order.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); bydate = order daily by date;
对多列进行排序
--order2key.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); bydatensymbol = order daily by date, symbol;
desc关键词按降序进行排序,null小于所有词
--orderdesc.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); byclose = order daily by close desc, open; dump byclose; -- open still sorted in ascending order
。distinct只能去掉整个元组的重复行,不能去掉某几个特定列的重复行
--distinct.pig -- find a distinct list of ticker symbols for each exchange -- This load will truncate the records, picking up just the first two fields. daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray); uniq = distinct daily;
。join/left join / right join
null不匹配任何数据
-- join2key.pig daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends); jnd = join daily by (symbol, date), divs by (symbol, date);
--leftjoin.pig daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends); jnd = join daily by (symbol, date) left outer, divs by (symbol, date);
也可以同时多个变量,但只用于inner join
A = load 'input1' as (x, y); B = load 'input2' as (u, v); C = load 'input3' as (e, f); alpha = join A by x, B by u, C by e;
也可以自身和自身join,但数据要加载两次
--selfjoin.pig -- For each stock, find all dividends that increased between two dates divs1 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); divs2 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); jnd = join divs1 by symbol, divs2 by symbol; increased = filter jnd by divs1::date < divs2::date and divs1::dividends < divs2::dividends;
下面这样不行
--selfjoin.pig -- For each stock, find all dividends that increased between two dates divs1 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); jnd = join divs1 by symbol, divs1 by symbol; increased = filter jnd by divs1::date < divs2::date and divs1::dividends < divs2::dividends;
。union 相当与sql中的union,但与sql不通的是pig中的union可以针对两个不同模式的变量:如果两个变量模式相同,那么union后的变量模式与变量的模式一样;如果一个变量的模式可以由另一各变量的模式强制类型转换,那么union后的变量模式与转换后的变量模式相同;否则,union后的变量没有模式。
A = load 'input1' as (x:int, y:float); B = load 'input2' as (x:int, y:float); C = union A, B; describe C; C: {x: int,y: float} A = load 'input1' as (x:double, y:float); B = load 'input2' as (x:int, y:double); C = union A, B; describe C; C: {x: double,y: double} A = load 'input1' as (x:int, y:float); B = load 'input2' as (x:int, y:chararray); C = union A, B; describe C; Schema for C unknown.
注意:在pig 1.0中 执行不了最后一种union。
注意:union不会剔除重复的行
如果需要对两个具有不通列名的变量union的话,可以使用onschema关键字
A = load 'input1' as (w: chararray, x:int, y:float); B = load 'input2' as (x:int, y:double, z:chararray); C = union onschema A, B; describe C; C: {w: chararray,x: int,y: double,z: chararray}
。cross 相当于离散数学中的叉乘,输入行数分别为m行,n行,输出行数则为m*n行。
--thetajoin.pig --I recommand running this one on a cluster too daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); crossed = cross daily, divs; tjnd = filter crossed by daily::date < divs::date;
。limit
--limit.pig divs = load 'NYSE_dividends'; first10 = limit divs 10;
在pig中除了order by 之外生成的数据都没有固定的顺序。上面的程序每次生成的数据也是不一样的。
。sample 用于生成测试数据,按指定参数选取部分数据。下面的程序选取10%的数据。
--sample.pig divs = load 'NYSE_dividends'; some = sample divs 0.1;
。Parallel 设置pig的reduce进程个数
--parallel.pig daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); bysymbl = group daily by symbol parallel 10;
parallel只针对一条语句,如果希望脚本中的所有语句都有10个reduce进程,可以使用 set default_parallel 10命令
--defaultparallel.pig set default_parallel 10; daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); bysymbl = group daily by symbol; average = foreach bysymbl generate group, AVG(daily.close) as avg; sorted = order average by avg desc;
如果同时使用parallel和set default_parallel,那么parallel中的参数将覆盖set default_parallel
。UDF
注册udf
--register.pig register 'your_path_to_piggybank/piggybank.jar'; divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); backwards = foreach divs generate org.apache.pig.piggybank.evaluation.string.Reverse(symbol);
定义udf别名
--define.pig register 'your_path_to_piggybank/piggybank.jar'; define reverse org.apache.pig.piggybank.evaluation.string.Reverse(); divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); backwards = foreach divs generate reverse(symbol);
构造函数带参数的udf
--define_constructor_args.pig register 'acme.jar'; define convert com.acme.financial.CurrencyConverter('dollar', 'euro'); divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); backwards = foreach divs generate convert(dividends);
。托管java中的静态函数(效率较低)
--invoker.pig define hex InvokeForString('java.lang.Integer.toHexString', 'int'); divs = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); nonnull = filter divs by volume is not null; inhex = foreach nonnull generate symbol, hex((int)volume);
如果函数的参数是一个数组,那么传递过去的是一个bag
define stdev InvokeForDouble('com.acme.Stats.stdev', 'double[]'); A = load 'input' as (id: int, dp:double); B = group A by id; C = foreach B generate group, stdev(A.dp);
。multiquery
--multiquery.pig players = load 'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); pwithba = foreach players generate name, team, position, bat#'batting_average' as batavg; byteam = group pwithba by team; avgbyteam = foreach byteam generate group, AVG(pwithba.batavg); store avgbyteam into 'by_team'; flattenpos = foreach pwithba generate name, team, flatten(position) as position, batavg; bypos = group flattenpos by position; avgbypos = foreach bypos generate group, AVG(flattenpos.batavg); store avgbypos into 'by_position';
。split
wlogs = load 'weblogs' as (pageid, url, timestamp); split wlogs into apr03 if timestamp < '20110404', apr02 if timestamp < '20110403' and timestamp > '20110401', apr01 if timestamp < '20110402' and timestamp > '20110331'; store apr03 into '20110403'; store apr02 into '20110402'; store apr01 into '20110401';
。设置pig环境
Parameter | Value Type | Description |
---|---|---|
debug | string | Sets the logging level to DEBUG . Equivalent to passing -debug DEBUG on the command line. |
default_parallel | integer | Sets a default parallel level for all reduce operations in the script. See the section called “Parallel” for details. |
job.name | string | Assigns a name to the Hadoop job. By default the name is the filename of the script being run, or a randomly generated name for interactive sessions. |
job.priority | string Type | If your Hadoop cluster is using the Capacity Scheduler with priorities enabled for queues, this allows you to set the priority of your Pig job. Allowed values are very_low , low , normal , high , very_high . |
。parameter 向pig脚本传递参数
--daily.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); yesterday = filter daily by date == '$DATE'; grpd = group yesterday all; minmax = foreach grpd generate MAX(yesterday.high), MIN(yesterday.low);
用-p 传递参数,每个变量前都要加一个-p
pig -p DATE=2009-12-17 daily.pig
参数也可以放在一个文件里,每行一个参数,注释部分以#开头,使用-m
或者 -param_file
.调用参数文件
pig脚本
wlogs = load 'clicks/$YEAR$MONTH01' as (url, pageid, timestamp);
参数文件
#Param file YEAR=2009- MONTH=12- DAY=17 DATE=$YEAR$MONTH$DAY
执行
pig -param_file daily.params daily.pig
也可以在pig内定义参数%declare 或者 %default,%default定义默认的参数,在特殊情况下可以被覆盖
注意:%declare和%default不能用于以下位置:
- pig脚本,此脚本非Macro宏,并且脚本被另外一个脚本调用(如果不被调用可以使用)
%default parallel_factor 10; wlogs = load 'clicks' as (url, pageid, timestamp); grp = group wlogs by pageid parallel $parallel_factor; cntd = foreach grp generate group, COUNT(wlogs);
。定义Macro宏,相当于子函数
--macro.pig -- Given daily input and a particular year, analyze how -- stock prices changed on days dividends were paid out. define dividend_analysis (daily, year, daily_symbol, daily_open, daily_close) returns analyzed { divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); divsthisyear = filter divs by date matches '$year-.*'; dailythisyear = filter $daily by date matches '$year-.*'; jnd = join divsthisyear by symbol, dailythisyear by $daily_symbol; $analyzed = foreach jnd generate dailythisyear::$daily_symbol, $daily_close - $daily_open; }; daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); results = dividend_analysis(daily, '2009', 'symbol', 'open', 'close');
。引用pig文件,被引用的文件被执行一遍,相当于拼接在一起,被引用的文件中不能存在自定义变量
--main.pig import '../examples/ch6/dividend_analysis.pig'; daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); results = dividend_analysis(daily, '2009', 'symbol', 'open', 'close');
默认搜索文件夹为当前文件夹,可以使用set pig.import.search.path设置搜索的路径
set pig.import.search.path '/usr/local/pig,/grid/pig'; import 'acme/macros.pig';