代码改变世界

HIVE outer join TIPS

2014-02-08 12:05  briller  阅读(542)  评论(0编辑  收藏  举报

OUTER JOIN Gotcha
Before we discuss the other outer joins, let’s discuss a gotcha you should understand.
JOIN Statements | 101Recall what we said previously about speeding up queries by adding partition filters in
the WHERE clause. To speed up our previous query, we might choose to add predicates
that select on the exchange in both tables:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL'
> AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
1987-05-11
AAPL
77.0
0.015
1987-08-10
AAPL
48.25
0.015
1987-11-17
AAPL
35.0
0.02
1988-02-12
AAPL
41.0
0.02
1988-05-16
AAPL
41.25
0.02
...
However, the output has changed, even though we thought we were just adding an
optimization! We’re back to having approximately four stock records per year and we
have non- NULL entries for all the dividend values. In other words, we are back to the
original inner join!
This is actually common behavior for all outer joins in most SQL implementations. It
occurs because the JOIN clause is evaluated first, then the results are passed through
the WHERE clause. By the time the WHERE clause is reached, d.exchange is NULL most of the
time, so the “optimization” actually filters out all records except those on the day of
dividend payments.
One solution is straightforward; remove the clauses in the WHERE clause that reference
the dividends table:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL' AND s.exchange = 'NASDAQ';
...
1987-05-07
AAPL
80.25
NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
...
This isn’t very satisfactory. You might wonder if you can move the predicates from the
WHERE clause into the ON clause, at least the partition filters. This does not work for outer
joins, despite documentation on the Hive Wiki that claims it should work (https://cwiki
.apache.org/confluence/display/Hive/LanguageManual+Joins).
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d
> ON s.ymd = d.ymd AND s.symbol = d.symbol
> AND s.symbol = 'AAPL' AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
...
1962-01-02
GE
74.75
NULL
1962-01-02
IBM
572.0
NULL
1962-01-03
GE
74.0
NULL
1962-01-03
IBM
577.0
NULL
102 | Chapter 6: HiveQL: Queries1962-01-04
1962-01-04
1962-01-05
1962-01-05
...
GE
IBM
GE
IBM
73.12
NULL
571.25 NULL
71.25
NULL
560.0
NULL
The partition filters are ignored for OUTER JOINTS . However, using such filter predicates
in ON clauses for inner joins does work!
Fortunately, there is solution that works for all joins; use nested SELECT statements:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM
> (SELECT * FROM stocks WHERE symbol = 'AAPL' AND exchange = 'NASDAQ') s
> LEFT OUTER JOIN
> (SELECT * FROM dividends WHERE symbol = 'AAPL' AND exchange = 'NASDAQ') d
> ON s.ymd = d.ymd;
...
1988-02-10
AAPL
41.0
NULL
1988-02-11
AAPL
40.63
NULL
1988-02-12
AAPL
41.0
0.02
1988-02-16
AAPL
41.25
NULL
1988-02-17
AAPL
41.88
NULL
...
The nested SELECT statement performs the required “push down” to apply the partition
filters before data is joined.
WHERE clauses are evaluated after joins are performed, so WHERE clauses
should use predicates that only filter on column values that won’t be
NULL . Also, contrary to Hive documentation, partition filters don’t work
in ON clauses for OUTER JOINS , although they do work for INNER JOINS !

---------------------------------------------------------

LEFT SEMI-JOIN
A left semi-join returns records from the lefthand table if records are found in the right-
hand table that satisfy the ON predicates. It’s a special, optimized case of the more general
inner join. Most SQL dialects support an IN ... EXISTS construct to do the same thing.
For instance, the following query in Example 6-2 attempts to return stock records only
on the days of dividend payments, but it doesn’t work in Hive.
Example 6-2. Query that will not work in Hive
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);
Instead, you use the following LEFT SEMI JOIN syntax:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;
...
1962-11-05
IBM
361.5
1962-08-07
IBM
373.25
1962-05-08
IBM
459.5
1962-02-06
IBM
551.5
Note that the SELECT and WHERE clauses can’t reference columns from the righthand
table

----------------------

The hint still works, but it’s now deprecated as of Hive v0.7. However, you still have
to set a property, hive.auto.convert.join , to true before Hive will attempt the opti-
mization. It’s false by default:
hive> set hive.auto.convert.join=true;
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
Note that you can also configure the threshold size for table files considered small
enough to use this optimization. Here is the default definition of the property (in bytes):
hive.mapjoin.smalltable.filesize=25000000

 

分块取样

 

hive> SELECT * FROM numbersflat TABLESAMPLE(0.1 PERCENT) s;
This sampling is not known to work with all file formats. Also, the
smallest unit of sampling is a single HDFS block. Hence, for tables less
than the typical block size of 128 MB, all rows will be retuned.

 

<property>
<name>hive.sample.seednumber</name>
<value>0</value>
<description>A number used for percentage sampling. By changing this
number, user will change the subsets of data sampled.</description>
</property>

 

Instead, if we bucket the weblog table and use user_id as the bucketing column, the
value of this column will be hashed by a user-defined number into buckets. Records
with the same user_id will always be stored in the same bucket. Assuming the number
of users is much greater than the number of buckets, each bucket will have many users:
hive> CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (user_id) INTO 96 BUCKETS;
However, it is up to you to insert data correctly into the table! The specification in
CREATE TABLE only defines metadata, but has no effect on commands that actually pop-
ulate the table.