hive 高级查询1

hadoop hive 高级查询

select基础

1.0 一般查询

1)select * from table_name

2)select * from table_name where name='....' limit 1;

1.1cte和嵌套查询

1)with t as(select....) select * from t;

2)select * from(select....) a;(a一定要添加)

1.2列匹配正则表达式

在添加数据前：SET hive.support.quoted.identifiers = none;

就可以使用匹配列：SELECT ^o.* FROM offers;

1.3 虚拟列（Virtual Columns）

输入文件名称： select input_file_name from emps;

全局文件位置：select block_offset_inside_file from emps;（加重为固定格式~~~~）

我们会把小表放在前面，后面的表格称为基表

在我们内外关联的时候，先将外表的一行数据和子表的一行数据进行判断，存在相当于关键词exit (not exists),mysql中的关键词 in（not in）

select * from userinfos u where userid not in(select b.userid from bankcards b where u.userid=b.userid group by userid);

Hive join-Mapjoin（内外部关联）

首先我们要先开启join操作:set hive.auto.convert.join

join——>相当于inner join

left join——>只查左边的数据

right join——>只查右边的数据

full join——>查询所有的数据

Mapjoin操作不支持:

1)在UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER BY/DISTRIBUTE BY等操作后面

2)在UNION, JOIN 以及其他MAPJOIN之前

Hive 集合操作（union）

1）Union all：合并后保留重复项

2）Union ：合并后删除重复项

装载数据：load移动数据

1）load data local inpath '......' overwrite into table.....

2）load data local inpath '.......' overwrite in to table....partition(字段)

！！！没有local 就是在hdfs 中的地址

！！！ LOCAL表示文件位于本地，OVERWRITE表示覆盖现有数据

装载数据：Insert表插入数据-2

1)单条语句插入(从一个表格中插入某一个)

from ctas_employee

insert overwrite table .....select '....'

！！！相当于两个表的列数相同属性相同，插入的数据才会有

2)多条语句插入（overwrite table 后面跟其他表格）

from ctas_employee

insert overwrite table employee select *

insert overwrite table employee_internal select *;

！！！在第一条语句的结尾不加；则可执行多条语句

3)插入到分区

from ctas_patitioned

insert overwrite table employee PARTITION (year, month)

select *,'2018','09';

！！！在执行静态插入时要指定（year=2018,month=9）

！！！在执行动态插入时不需要指定，如果插入分区的关键字少了，直接在select中添加数值即可。

insert 语句将数据插入/导出到文件

-- 从同一数据源插入本地文件，hdfs文件，表（关键是同一数据源）

from ctas_employee（固定语句）

本地：insert overwrite local directory '/tmp/out1' select *；

hdfs：insert overwrite directory '/tmp/out1' select *；

table：insert overwrite table employee_internal select *;

Hive数据交换-import/export

1) 使用export导出数据

export table table_name to 'hdfs路径'；

export table table_name_partition(year,month) to 'hdfs路径'；（year,month的数据要有，才会生成新的表格）

import table table_name from '之前导出的数据地址'

import table old_table from ‘之前导出的数据地址’ （以有表格的分区要没有才可以导入）

删除分区： alter table uu drop partition(year=2017,month=12);

Hive数据排序 ORDER BY

select * from table_name order by 列名；

Hive数据排序-SORT BY/DISTRIBUTE BY

！！！关键：设置reduce 数量：set mapred.reduce.tasks = 15 （排序和reduce的数量有关）

1) sort by（对每个reducer中的数据进行排序）

设置reduce 数量：set mapred.reduce.tasks = 1

当reducer数量设置为1时，才可以保证表格的排序有效

当reducer数量设置为2时，分成两段进行排序（表格的排序为两种排序）

2) distribute by(类似于group by)

类似于先进行分组在配合sort by desc使用（前面的属性）如下

Hive 数据排序-CLUSTER BY（集群）

3）cluster by = distribute by + sort by

SELECT name, employee_id FROM employee_hr CLUSTER BY name;

n为了充分利用所有的Reducer来执行全局排序，可以先使用CLUSTER BY，然后使用ORDER BY

实例一：（解决数据的倾斜）

设置reduce 数量：set mapred.reduce.tasks = 15

1.大小表

1.mapreduce

1.cacheFile

2.groupCombinapartition

2.hive中的处理

1.mapjoin中：set hive.auto.convert.join=true 25M

2.部分数据相对特别少

1.mapreduce

1.groupCombinapartition

2.hive

1.partition by (year,month)分区表

posted on 2019-07-23 23:32 来勒阅读(1199) 评论(0) 编辑收藏举报

努力加载评论中...

刷新页面返回顶部