pig 介绍与pig版 hello world
前两天使用pig做ETL,粗浅的看了一下,没有系统地学习,感觉pig还是值得学习的,故又重新看programming pig.
以下是看的第一章的笔记:
What is pig?
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a
language, Pig Latin, for expressing these data flows. Pig Latin includes operators for
many of the traditional data operations (join, sort, filter, etc.), as well as the ability for
users to develop their own functions for reading, processing, and writing data.
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.
pig Latin for a language, Grunt for a shell, and Piggybank for a CPAN-like shared repository。
What is pig used for ?
ETL?
research for raw data (unstructured)
Pig Philosophy
eat everything ;
live anywhere;
pig fly;
domestic animal;(easy to write UDF)
pig版 hello world:
data:
hello world, hello pig
hello hadooop, hello hdfs
I love programming
I love this world
I love programming with pig
pig script:
txt = load 'data.txt' as (line);
words = foreach txt generate flatten(TOKENIZE(line)) as word;
grpd = group words by word;
describe grpd
cntd = foreach grpd generate group, COUNT(words);
dump cntd