大数据计算常用技术—Spark基本语法【整理】

通过这几天自学，发现Scala是一门完全面向对象的语言（OOP）。每一个标示符都是方法，每个变量都是一个对象。

=================================================================

模块零：数据类型

1，原生数据类型

亮点：时间粒度转换

date类型只能和date、timestamp和string进行显式转换（cast）

2，复杂数据类型

（1）array示例

创建数据表“array_test”，将“id”参数定义为“array<int>”。然后将已存在的文本“array_test.txt”导入“array_test”中。操作如下：

1.创建表。

create table array_test(name string, id array<int>)

row format delimited

fields terminated by ','

collection items terminated by ':';

2.导入数据。

“array_test.txt”文件路径为“/opt/array_test.txt”，文件内容如下所示：

100,1:2:3:4

101,5:6

102,7:8:9:10

执行如下命令导入数据。

load data local inpath '/opt/array_test.txt' into table array_test;

3.查询结果。

查“array_test”表中的所有数据：

select * from array_test;

100 [1,2,3,4]

101 [5,6]

102 [7,8,9,10]

查“array_test”表中id数组第0个元素的数据。

select id[0] from array_test;

（2）map示例

创建数据表“map_test”，将“score”参数定义为“map<string,int>)”，然后将已存在的文本“map_test.txt”导入至“map_test”中。操作如下：

1.创建表。

create table map_test(id string, score map<string,int>)

row format delimited

fields terminated by '|'

collection items terminated by ','

map keys terminated by ':';

2.导入数据。

“map_test.txt”文件路径为“/opt/map_test.txt”，文件内容如下所示：

1|math:90,english:89,physics:86

2|math:88,english:90,physics:92

3|math:93,english:93,physics:83

执行如下命令导入数据：

load data local inpath '/opt/map_test.txt' into table map_test;

3.查询结果。

查询“map_test”表里的所有数据。

select * from map_test;

1 {"english":89,"math":90,"physics":86}

2 {"english":90,"math":88,"physics":92}

3 {"english":93,"math":93,"physics":83}

查询“map_test”表中的数学成绩。

select id, score['math'] from map_test;

190

288

393

（3）struct示例

创建数据表“struct_test”

将info定义为“struct<name:string, age:int>”。然后，将已存在的文本“struct_test.txt”导入至“struct_test”表中。操作如下：

1.创建表。

create table struct_test(id int, info struct<name:string,age:int>)

row format delimited

fields terminated by ','

collection items terminated by ':';

2.导入数据。

“struct_test.txt”文件路径为“/opt/struct_test.txt”，文件内容如下所示：

1,lily:26

2,sam:28

3,mike:31

4,jack:29

执行如下命令导入数据：

load data local inpath '/opt/struct_test.txt' into table struct_test;

3.查询结果。

查询“struct_test”表中的所有数据。

select * from struct_test;

1 {"name":"lily","age":26}

2 {"name":"sam","age":28}

3 {"name":"mike","age":31}

4 {"name":"jack","age":29}

查询“struct_test”表中姓名和年龄。

select id,info.name,info.age from struct_test;

1 lily 26

2 sam 28

3 mike 31

4 jack 29

=================================================================

模块一：标示符

1.1 aggregate_func

聚合函数

1.2 alias

别名，可给字段、表、视图、子查询起表别名，仅支持字符串类型。

1.3 attr_expr

属性表达式。

1.4 attrs_value_set_expr

属性值集合。

1.5 class_name

函数所依赖的类名，注意类名需要包含类所在的包的完整路径。

1.6 col

函数调用时的形参，一般即为字段名称，与col_name相同。

1.7 col_comment

对列（字段）的描述，仅支持字符串类型。

1.8 col_name

列名，即字段名称，仅支持字符串类型，名称长度一般不能超过128个字节。

1.9 col_name_list

字段列表，可由一个或多个col_name构成，多个col_name之间用逗号分隔。

1.10 col_value

列值，即字段值。

1.11 col_value_list

列值构成的列表，由一个或者多个col_value构成，多个col_value之间用逗号隔开。

1.12 condition

逻辑判断条件。

=======================

关系运算符：

> 大于

>= 大于等于

< 小于

<= 小于等于

= 等于

<> 不等于

is 是

is not 不是

=======================

const_null 常量：空值。

like 关系运算符：用于通配符匹配。

pattern_string 模式匹配字符串，支持通配符匹配

当where like条件过滤时，支持sql通配符中“%”与“_”，

“%”代表一个或多个字符，

“_”仅代表一个字符。

create table like中不支持通配符。

attr_expr 属性表达式

attrs_value_set_expr 属性值集合

in 关键字，用于判断属性是否在一个集合中

const_string 字符串常量

const_int 整型常量

( 指定常量集合开始

) 指定常量集合结束

, 逗号分隔符

=======================

1.13 condition_list

逻辑判断条件列表

=======================

逻辑运算符：

and 与

or 或

not 非

=======================

( 子逻辑判断条件开始

) 子逻辑判断条件结束

=======================

1.14 cte_name

公共表达式的名字

1.15 data_type

数据类型，包括原生数据类型与复杂数据类型

1.16 db_comment

对数据库的描述，仅支持字符串类型

1.17 db_name

数据库名称，仅支持字符串类型，名称长度一般不能超过128个字节

1.18 else_result_expression

case when语句中else语句后的返回结果

1.19 file_format

数据格式，目前接触的有

1）textfile默认格式，数据不做压缩，磁盘开销大，数据解析开销大；

2）inputformat input_format_classname outputformat output_format_classname

inputformat可指定输入格式，outputformat可指定输出格式，

如下用法：

create table student (id int, name string)

stored as inputformat 'org.apache.hadoop.mapred.textinputformat'

outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat';

1.20 file_path

文件路径，该路径可以是本地路径也可以是HDFS路径

1.21 function_name

函数名称，仅支持字符串类型

1.22 having_condition

having逻辑判断条件，常见的数据形式相同

=======================

and 与

or 或

not 非

( 子逻辑判断条件开始

) 子逻辑判断条件结束

const_set 常量集合，元素间逗号分隔

in 关键字，用于判断属性是否在一个集合中

attrs_value_set_expr 如上1.12，属性值集合

attr_expr 属性表达式语法规范

Equality and inequality 等式与不等式

pattern_string 如上1.12

like 关系运算符：用于通配符匹配

=======================

1.23 hdfs_path

HDFS的路径，如“hdfs:///tmp”

1.24 input_expression

case when的输入表达式

1.25 join_condition

join逻辑判断条件

1.26 number

limit限制输出的行数，只支持int类型

1.27 num_buckets

分桶的个数，仅支持int类型

1.28 partition_col_name

分区列名，即分区字段名称，仅支持字符串类型

1.29 partition_col_value

分区列值，即分区字段的值

1.30 partition_specs

表的分区列表，以key=value的形式表现，key为partition_col_name ，vaule为partition_col_value ，若存在多个分区字段，每组key=value之间用逗号分隔。

格式

partition_specs :: (partition_col_name = partition_col_value, partition_col_name = partition_col_value, ...);

1.31 property_name

属性名称，仅支持字符串类型

1.32 property_value

属性值，仅支持字符串类型

1.33 regex_expression

模式匹配字符串，支持通配符匹配。在show functions中，可利用通配符进行匹配，目前支持“.”与“.*”，其中“.”仅代表一个字符，“.*”代表一个或多个字符。

1.34 result_expression

case when语句中then语句后的返回结果case when语句中then语句后的返回结果

1.35 role_name

角色名称，仅支持字符串类型角色名称，仅支持字符串类型

1.36 row_format

1.37 select_statement

select基本语句，即查询语句

1.38 separator

分隔符，仅支持char类型，可用户自定义，如逗号、分号、冒号等

1.39 sql_containing_cte_name

包含了cte_name定义的公共表达式的SQL语句

1.40 table_comment

对表的描述，仅支持字符串类型

1.41 table_name

表名称，仅支持字符串类型，名称长度一般不能超过128个字节

1.42 table_properties

表的属性列表，以key=value的形式表示，key为property_name，value为property_value，列表中每组key=value之间用逗号分隔

1.43 table_reference

表或视图的名称，仅支持字符串类型，也可为子查询，当为子查询时，必须加别名

1.44 user_name

用户名称，仅支持字符串类型

1.45 view_name

视图名称，仅支持字符串类型，名称长度一般不能超过128个字节

1.46 view_properties

视图的属性列表，以key=value的形式表示，key为property_name，value为property_value，列表中每组key=value之间用逗号分隔

1.47 when_expression

case when语句的when表达式，与输入表达式进行匹配

1.48 where_condition

where逻辑判断条件

1.49 window_function

where逻辑判断条件分析窗口函数

模块二：内置函数

数学函数

round(double a) double

round(double a, int d) double

bround(double a) double

bround(double a, int d) double

floor(double a) bigint

ceil(double a), ceiling(double a) bigint

rand(), rand(int seed) double

exp(double a), exp(decimal a) double

ln(double a), ln(decimal a) double

log10(double a), log10(decimal a) double

log2(double a), log2(decimal a) double

log(double base, double a) double

log(decimal base, decimal a)

pow(double a, double p), power(double a, double p) double

sqrt(double a), sqrt(decimal a) double

bin(bigint a) string

hex(bigint a) hex(string a) hex(binary a) string

unhex(string a) binary

conv(bigint num, int from_base, int to_base), conv(string num, int from_base, int to_base) string

abs(double a) double

pmod(int a, int b), pmod(double a, double b) int or double

sin(double a), sin(decimal a) double

asin(double a), asin(decimal a) double

cos(double a), cos(decimal a) double

acos(double a), acos(decimal a) double

tan(double a), tan(decimal a) double

atan(double a), atan(decimal a) double

degrees(double a), degrees(decimal a) double

radians(double a), radians(double a) double

positive(int a), positive(double a) int or double

negative(int a), negative(double a) int or double

sign(double a), sign(decimal a) double or int

e() double

pi() double

factorial(int a) bigint

cbrt(double a) double

shiftleft(tinyint|smallint|int a, int b) int

shiftleft(bigint a, intb) bigint

shiftright(tinyint|smallint|int a, intb) int

shiftright(bigint a, intb) bigint

shiftrightunsigned(tinyint|smallint|inta, int b), int

shiftrightunsigned(bigint a, int b) bigint

greatest(t v1, t v2, ...) t

least(t v1, t v2, ...) t

时间函数

from_unixtime(bigint unixtime[, string format]) string

unix_timestamp() bigint

unix_timestamp(string date) bigint

unix_timestamp(string date, string pattern) bigint

to_date(string timestamp) string

year(string date) int

quarter(date/timestamp/string) int

month(string date) int

day(string date) dayofmonth(date) int

hour(string date) int

minute(string date) int

second(string date) int

weekofyear(string date) int

datediff(string enddate, string startdate) int

date_add(string startdate, int days) string

date_sub(string startdate, int days) string

from_utc_timestamp(timestamp, string timezone) timestamp

to_utc_timestamp(timestamp, string timezone) timestamp

current_date date

current_timestamp timestamp

add_months(string start_date, int num_months) string

last_day(string date) string

next_day(string start_date, string day_of_week) string

trunc(string date, string format) string

months_between(date1, date2) double

date_format(date/timestamp/string ts, string fmt) string

字符串函数

ascii(string str) int

base64(binary bin) string

concat(string|binary a, string|binary b...) string

concat_ws(string sep, string a, string b...) string

concat_ws(string sep, array<string>) string

decode(binary bin, string charset) string

encode(string src, string charset) binary

find_in_set(string str, string strlist) int

format_number(number x, int d) string

get_json_object(string json_string, string path) string

in_file(string str, string filename) boolean

instr(string str, string substr) int

length(string a) int

locate(string substr, string str[, int pos]) int

lower(string a) lcase(string a) string

lpad(string str, int len, string pad) string

ltrim(string a) string

parse_url(string urlstring, string parttoextract [, string keytoextract]) string

printf(string format, obj... args) string

regexp_extract(string subject, string pattern, int index) string

regexp_replace(string a, string b, string c) string

repeat(string str, int n) string

reverse(string a) string

rpad(string str, int len, string pad) string

rtrim(string a) string

sentences(string str, string lang, string locale) array<array< string >>

space(int n) string

split(string str, string pat) array

str_to_map(text[, delimiter1, delimiter2]) map< string, string >

substr(string|binary a, int start) substring(string|binary a, int start) string

substr(string|binary a, int start, int len) substring(string|binary a, int start, int len) string

substring_index(string a, string delim, int count) string

trim(string a) string

unbase64(string str) binary

upper(string a) ucase(string a) string

initcap(string a) string

levenshtein(string a, string b) int

soundex(string a) string

聚集函数

是从一组输入值计算一个结果。例如使用COUNT函数计算SQL查询语句返回的记录行数

count(*), count(expr), count(distinct expr[, expr...]) bigint

sum(col), sum(distinct col) double

avg(col), avg(distinct col) double

min(col) double

max(col) double

variance(col), var_pop(col) double

var_samp(col) double

stddev_pop(col) double

stddev_samp(col) double

covar_pop(col1, col2) double

covar_samp(col1, col2) double

corr(col1, col2) double

percentile(bigint col, p) double

percentile(bigint col, array(p1[, p2]...)) array<double>

percentile_approx(double col, p [, b]) double

percentile_approx(double col, array(p1[, p2]...) [, b]) array<double>

histogram_numeric(col, b) array<struct {'x','y'}>

collect_set(col) array

collect_list(col) array

ntile(int x) int

分析窗口函数

窗口函数用于在与当前输入值相关的一组值上执行相关函数计算，包括在GROUP BY中使用的聚集函数，如sum函数、max函数、min函数、count函数、avg函数以及下

first_value(col) 参数的数据类型

last_value(col) 参数的数据类型

lag (col,n,default) 参数的数据类型

lead (col,n,default) 参数的数据类型

row_number() over (order by col_1[,col_2 ...]) int

rank() int

dense_rank() int

ntile(int x) int

cume_dist() double

percent_rank() double

posted on 2017-07-26 20:58 weizhang715 阅读(1070) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

喂，章

大数据计算常用技术—Spark基本语法【整理】

导航

公告