导航

<

2025年3月

>

日

一

二

三

四

五

六

23

24

25

26

27

28

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

统计

随笔 - 45
文章 - 0
评论 - 2
阅读 - 59976

公告

昵称： David_Zhu
园龄： 6年6个月
粉丝： 3
关注： 0
+加关注

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

1. Re:使用kettle 的repository
你好，我按照你刚才的指导配置了repository，测试连接的时候正常，但是connection的时候连不上 You don't seem to be getting a connection to ...
--KoriZ
2. Re:数据治理的方法论
理资产、治数据，一体化大数据治理为什么需要数据治理? 大数据不是凭空而来，1981 年第一个数据仓库诞生，到现在已经有了近 40 年的历史，而国内数据平台的建设大概从 90 年代末就开始了，从第一代...
--Deltayu

12 2018 档案

大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中
摘要：一，选择数据库，这里使用标准mysql sakila数据库 mysql -u root -D sakila -p 二。首先尝试把表中的数据导入到hdfs文件中，这样后续就可以使用spark来dataframe或者rdd来处理数据 sqoop import --connect "jdbc:mysql: 阅读全文

posted @ 2018-12-26 14:10 David_Zhu 阅读(626) 评论(0) 推荐(0) 编辑
大数据入门到精通17--union all 和disctinct 的用法
摘要：一。union all 的用法。使用union all 或者 unionselect * from rental where rental_id <10union allselect * from rental where rental_id >30 and rental_id <40 二。disc 阅读全文

posted @ 2018-12-20 11:36 David_Zhu 阅读(249) 评论(0) 推荐(0) 编辑
大数据入门到精通16--hive 的条件语句和聚合函数
摘要：一。条件表达 case when ... then when .... then ... when ... then ...end select film_id,rpad(title,20," "),case when rating in ("G","PG","PG-13") then "YOUNG 阅读全文

posted @ 2018-12-12 18:33 David_Zhu 阅读(825) 评论(0) 推荐(0) 编辑
大数据入门到精通15--hive 对 date类型的处理
摘要：一。基础日期处理 //date 日期处理select current_date;select current_timestamp;//to_date(time) ;to_date(string)select to_date(current_timestamp);select to_date(rent 阅读全文

posted @ 2018-12-12 16:46 David_Zhu 阅读(1399) 评论(0) 推荐(0) 编辑
大数据入门到精通14--hive 对字符串的操作
摘要：一、基本操作 concat(string,string,string)concat_ws(string,string,string)select customer_id,concat_ws(" ",first_name,last_name),email,address_id from custome 阅读全文

posted @ 2018-12-12 14:42 David_Zhu 阅读(228) 评论(0) 推荐(0) 编辑
大数据入门到精通13--为后续和MySQL数据库准备
摘要：We will be using the sakila database extensively inside the rest of the course and it would be great if you can follow the installation process below. 阅读全文

posted @ 2018-12-11 18:46 David_Zhu 阅读(127) 评论(0) 推荐(0) 编辑
大数据入门到精通12--spark dataframe 注册成hive 的临时表
摘要：一、获得最初的数据并形成dataframe val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY =ny.filter(listing=>{ listing.split(",").size==14 && listin 阅读全文

posted @ 2018-12-11 13:52 David_Zhu 阅读(1070) 评论(0) 推荐(0) 编辑
大数据入门到精通11-spark dataframe 基础操作
摘要：// dataframe is the topic 一、获得基础数据。先通过rdd的方式获得数据 val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY =ny.filter(listing=>{ listing.sp 阅读全文

posted @ 2018-12-10 12:03 David_Zhu 阅读(285) 评论(0) 推荐(0) 编辑
大数据入门到精通10--spark rdd groupbykey的使用
摘要：//groupbykey 一、准备数据val flights=sc.textFile("data/Flights/flights.csv")val sampleFlights=sc.parallelize(flights.take(1000))val header=sampleFlights.fir 阅读全文

posted @ 2018-12-07 17:10 David_Zhu 阅读(2718) 评论(0) 推荐(0) 编辑
大数据入门到精通9-真正得wordcount
摘要：本章节实现一个真正得wordcount 得spark程序。一、从本地获得一个数据集 val speechRdd= sc.parallelize(scala.io.Source.fromFile("/home/hdfs/Data/WordCount/speech").getLines.toList) 阅读全文

posted @ 2018-12-06 14:22 David_Zhu 阅读(173) 评论(0) 推荐(0) 编辑
大数据入门到精通8-spark RDD 复合key 和复合value 的map reduce操作
摘要：一.做基础数据准备这次使用fights得数据。 scala> val flights= sc.textFile("/user/hdfs/data/Flights/flights.csv")flights: org.apache.spark.rdd.RDD[String] = /user/hdfs/ 阅读全文

posted @ 2018-12-03 14:47 David_Zhu 阅读(296) 评论(0) 推荐(0) 编辑