02-01-什么是大数据
大数据的应用举例:
1. 电商的推荐系统
存储:大量的订单如何存储
运算:大量的订单如何计算
2. 天气的预报
存储:大量的天气的数据如何存储
运算:大量的天气的数据如何计算
核心问题:
1. 存储:分布式的文件系统:HDFS(Hadoop Distributed File System)
2. 运算:不是算法,分布式的计算:MapReduce、Spark(RDD:弹性分布式数据集)
02-02-数据仓库和大数据
数据仓库就是一个数据库(Oracle、MySQL、MS),一般只做select
搭建数据仓库的过程.png
搭建数据仓库Data Warehouse可以使用传统的Oracle、MySQL来搭建,也可以使用hadoop、spark来搭建。
02-03-OLTP和OLAP
1、OLTP:Online Transaction Processing 联机事务处理,指:(insert、update、delete)--> 事务,传统的关系型数据库解决的问题
2、OLAP:Online Analytic Processing 联机分析处理,一般:只做查询select(分析)
02-04-分布式文件系统的基本思想
-
GFS: Google File System ---- HDFS: Hadoop Distributed File System
- 分布式文件系统
- 大数据的存储问题
- HDFS中,记录数据保存的位置信息(元信息)-----> 采用倒排索引(Reverted Index)
- 什么是索引?index
(1) create index 创建索引
(2) 就是一个目录
(3) 通过索引可以找到对应的数据
(4)问题:索引一定可以提高查询的速度吗? - 什么是倒排索引?
- 什么是索引?index
- 演示Demo:以伪分布环境为例
-
MapReduce:分布计算模型,问题来源是:PageRank(网页排名)
-
BigTable:大表 ------ NoSQL数据库:HBase
分布式文件系统的基本思想.png
02-05-什么是机架感知
机架感知的基本思想.png
02-06-什么是倒排索引
什么是索引.png
什么是倒排索引.png
02-07-HDFS的体系架构和Demo演示
02-08-什么是PageRank
Google的向量矩阵.png
02-09-MR编程模型
MapReduce的编程模型.png
02-10-Demo-单词计数WordCount
[ root@ demo11~]# start-yarn. sh
starting yarn daemons
starting resourcemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-resourcemanager-demol1. out localhost: starting nodemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-nodemanager-demo11. out
[root@demo11~]# jps
16164 ResourceManager
16596 Jps
15976 SecondaryNameNode
15772 DataNode
15661 NameNode
16271 NodeManager
[root@demo11~]# hdfs dfs -1s /input
Found 3 items
-rw-r--r-- 1 root supergroup 204 2018-08-14 11:18 /input/a.xml
-rw-r--r-- 1 root supergroup 60 2018-08-13 23:48 /input/data.txt
-rw-r--r-- 1 root supergroup 30826876 2018-08-17 10:19 /input/sales
[root@demo11 ~]# hdfs dfs-cat /input/data.txt
I love Beijing
I love China
Beijing is the capital of China
[root@demo11 ]# cd training/hadoop-2.7.3/share/hadoop/mapreduce/
[root@demo11 mapreduce]# pwd
/root/training/hadoop-2.7.3/share/hadoop/mapreduce
[root@demo11 mapreduce]# 1s hadoop-mapreduce-examples-2.7.3.jar hadoop-mapreduce-examples-2.7.3.jar
[rootedemol1 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar
An example program must be given as the first argument.
Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp:A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp:A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep:A map/reduce program that counts the matches of a regex in the input.
join:A job that effects a join over sorted, equally partitioned datasets multifilewc:A job that counts words from several files.
pentomino:A map/reduce tile laying program to find solutions to pentomino problems.
pi:A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter:A map/reduce program that writes 10GB of random textual data per node.
randomwriter:A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort:A map/reduce program that sorts the data written by the random writer.
sudoku:A sudoku solver.
teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount:A map/reduce program that counts the words in the input files.
wordmean:A map/reduce program that counts the average length of the words in the input files.
wordmedian:A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation:A map/reduce program that counts the standard deviation of the length of the words in the input files
[root@demo11 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /input/data.txt /output/day0829/wc1
可以通过http://192.168.157.11:8088/cluster监控任务的执行(Yarn的web console)
Yarn的web console
![](0001.大数据课程概述与大数据背景知识.assets/web console.png)
[root@demo11 mapreduce]# hdfs dfs -1s /output/day0829/wc1
Found 2 items
-rw-r--r--1 root supergroup 0 2018-08-29 20:57 /output/day0829/wc1/_SUCCESS
-rw-r--r--1 root supergroun 55 2018-08-29 20:57 /output/dav0829/wcl/part-r-00000
[root@demo11 mapreduce]# hdfs dfs -cat /output/day0829/wcl/part-r-00000
Beijing 2
China 2
I 2
capital 1
is 1
love 2
of 1
the 1