CDH 认证系列培训 - 随笔分类 - David_Zhu

大数据入门到精通19--mysql 数据导入到hive数据中

摘要：一。正常按照数据库和表导入 \\前面介绍了通过底层文件得形式导入到hive的表中，或者直接导入到hdfs中，\\现在介绍通过hive的database和table命令来从上层操作。sqoop import --connect "jdbc:mysql://host03.xyy:3306/sakila" 阅读全文

posted @ 2019-01-02 17:01 David_Zhu 阅读(1661) 评论(0) 推荐(0)

大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中

摘要：一，选择数据库，这里使用标准mysql sakila数据库 mysql -u root -D sakila -p 二。首先尝试把表中的数据导入到hdfs文件中，这样后续就可以使用spark来dataframe或者rdd来处理数据 sqoop import --connect "jdbc:mysql: 阅读全文

posted @ 2018-12-26 14:10 David_Zhu 阅读(641) 评论(0) 推荐(0)

大数据入门到精通17--union all 和disctinct 的用法

摘要：一。union all 的用法。使用union all 或者 unionselect * from rental where rental_id <10union allselect * from rental where rental_id >30 and rental_id <40 二。disc 阅读全文

posted @ 2018-12-20 11:36 David_Zhu 阅读(261) 评论(0) 推荐(0)

大数据入门到精通16--hive 的条件语句和聚合函数

摘要：一。条件表达 case when ... then when .... then ... when ... then ...end select film_id,rpad(title,20," "),case when rating in ("G","PG","PG-13") then "YOUNG 阅读全文

posted @ 2018-12-12 18:33 David_Zhu 阅读(866) 评论(0) 推荐(0)

大数据入门到精通14--hive 对字符串的操作

摘要：一、基本操作 concat(string,string,string)concat_ws(string,string,string)select customer_id,concat_ws(" ",first_name,last_name),email,address_id from custome 阅读全文

posted @ 2018-12-12 14:42 David_Zhu 阅读(251) 评论(0) 推荐(0)

大数据入门到精通13--为后续和MySQL数据库准备

摘要：We will be using the sakila database extensively inside the rest of the course and it would be great if you can follow the installation process below. 阅读全文

posted @ 2018-12-11 18:46 David_Zhu 阅读(146) 评论(0) 推荐(0)

大数据入门到精通12--spark dataframe 注册成hive 的临时表

摘要：一、获得最初的数据并形成dataframe val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY =ny.filter(listing=>{ listing.split(",").size==14 && listin 阅读全文

posted @ 2018-12-11 13:52 David_Zhu 阅读(1113) 评论(0) 推荐(0)

大数据入门到精通11-spark dataframe 基础操作

摘要：// dataframe is the topic 一、获得基础数据。先通过rdd的方式获得数据 val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY =ny.filter(listing=>{ listing.sp 阅读全文

posted @ 2018-12-10 12:03 David_Zhu 阅读(340) 评论(0) 推荐(0)

大数据入门到精通10--spark rdd groupbykey的使用

摘要：//groupbykey 一、准备数据val flights=sc.textFile("data/Flights/flights.csv")val sampleFlights=sc.parallelize(flights.take(1000))val header=sampleFlights.fir 阅读全文

posted @ 2018-12-07 17:10 David_Zhu 阅读(2751) 评论(0) 推荐(0)

大数据入门到精通9-真正得wordcount

摘要：本章节实现一个真正得wordcount 得spark程序。一、从本地获得一个数据集 val speechRdd= sc.parallelize(scala.io.Source.fromFile("/home/hdfs/Data/WordCount/speech").getLines.toList) 阅读全文

posted @ 2018-12-06 14:22 David_Zhu 阅读(188) 评论(0) 推荐(0)

大数据入门到精通8-spark RDD 复合key 和复合value 的map reduce操作

摘要：一.做基础数据准备这次使用fights得数据。 scala> val flights= sc.textFile("/user/hdfs/data/Flights/flights.csv")flights: org.apache.spark.rdd.RDD[String] = /user/hdfs/ 阅读全文

posted @ 2018-12-03 14:47 David_Zhu 阅读(327) 评论(0) 推荐(0)

大数据入门到精通7--对复合value做reducebykey

摘要：培训系列7--对复合value做reduce 1.做基础数据准备 val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")val header= collegesRdd.first val headerlessRdd= colle 阅读全文

posted @ 2018-11-23 16:48 David_Zhu 阅读(251) 评论(0) 推荐(0)

大数据入门到精通6---spark rdd reduce by key 的使用方法

摘要：1.前期数据准备（同之前的章节） val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")val header= collegesRdd.first val headerlessRdd= collegesRdd.filter( l 阅读全文

posted @ 2018-11-23 11:59 David_Zhu 阅读(391) 评论(0) 推荐(0)

大数据入门到精通5--spark 的 RDD 的 reduce方法使用

摘要：培训系列5--spark 的 RDD 的 reduce方法使用 1.spark-shell环境下准备数据 val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")val header= collegesRdd.first val 阅读全文

posted @ 2018-11-22 11:24 David_Zhu 阅读(729) 评论(0) 推荐(0)

大数据入门到精通4--spark的rdd的map使用方式

摘要：学习了之前的rdd的filter以后，这次来讲spark的map方式 1.获得文件 val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")val header= collegesRdd.first 2.通过filter获得纯粹的阅读全文

posted @ 2018-11-21 10:55 David_Zhu 阅读(1507) 评论(0) 推荐(0)

大数据入门到精通3-SPARK RDD filter 以及 filter 函数

摘要：一。如何处理RDD的filter 1. 把第一行的行头去掉 scala> val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")collegesRdd: org.apache.spark.rdd.RDD[String] = /u 阅读全文

posted @ 2018-11-20 14:21 David_Zhu 阅读(1553) 评论(0) 推荐(0)

大数据入门到精通2--spark rdd 获得数据的三种方法

摘要：通过hdfs或者spark用户登录操作系统，执行spark-shell spark-shell 也可以带参数，这样就覆盖了默认得参数 spark-shell --master yarn --num-executors 2 --executor-memory 2G --driver-memory 15 阅读全文

posted @ 2018-11-14 15:39 David_Zhu 阅读(929) 评论(0) 推荐(0)

大数据入门到精通1--大数据环境下的基础文件HDFS 操作

摘要：1.使用hdfs用户或者hadoop用户登录 2.在linux shell下执行命令 hadoop fs -put '本地文件名' hadoop fs - put '/home/hdfs/sample/sample.txt' hadoop fs -ls / 列出具体的文件名。 hadoop fs - 阅读全文

posted @ 2018-11-14 15:36 David_Zhu 阅读(471) 评论(0) 推荐(0)

导航

随笔分类 - CDH 认证系列培训