简单示例用例(Simple Example Use Cases)--hive GettingStarted用例翻译

1、MovieLens User Ratings

First, create a table with tab-delimited text file format:

首先,创建一个通过tab分隔的表:

CREATE TABLE u_data (

  userid INT,

  movieid INT,

  rating INT,

  unixtime STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

然后,下载数据文件从下面方法:

wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

or:

curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

Unzip the data files:

解压缩这个文件:

unzip ml-100k.zip

And load u.data into the table that was just created:

并且加载数据到刚刚创建的u_data表中:

LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

统计表u_data的行数:

SELECT COUNT(*) FROM u_data;

Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

现在我们可以做一些复杂的数据分析针对表u_data:

Create weekday_mapper.py:

import sys

import datetime

for line in sys.stdin:

  line = line.strip()

  userid, movieid, rating, unixtime = line.split('\t')

  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

  print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

使用这个脚本:

CREATE TABLE u_data_new (

  userid INT,

  movieid INT,

  rating INT,

  weekday INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new

SELECT

  TRANSFORM (userid, movieid, rating, unixtime)

  USING 'python weekday_mapper.py'

  AS (userid, movieid, rating, weekday)

FROM u_data;

解释:这里通过python脚本清洗表u_data中数据,使用python脚本通过

TRANSFORM (userid, movieid, rating, unixtime)   --输入字段

  USING 'python weekday_mapper.py'              --脚本处理

  AS (userid, movieid, rating, weekday)         --输出字段

SELECT weekday, COUNT(*)

FROM u_data_new

GROUP BY weekday;

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

2、Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.

More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.

CREATE TABLE apachelog (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"

)

STORED AS TEXTFILE;

posted @   ChavinKing  阅读(312)  评论(0编辑  收藏  举报
编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
历史上的今天:
2015-06-26 oracle在impdp时报ORA-31655和ORA-39154
点击右上角即可分享
微信分享提示