Hadoop学习笔记--Day 1

名词解释

CDH #(Cloudera’s Distribution including Apache Hadoop)
ecosystem projects #生态系统项目
Subscription #订阅
Volume #容积
Velocity #速度
Variety #多样的
ETL #Extract Transform Load
Collaborative filtering #协同过滤
Prediction models #预测模型
Sentiment analysis #文本倾向性分析
Risk assessment #风险评估
Pattern recognition #模式识别
ingests data # 获取数据
HDFS #(Hadoop Distributed File System)
inexpensive reliable storage #廉价可靠存储
industry-standard #行业标准
native file system #本地文件系统;
Distributed columnar key-value storage #分布式列式关键字存储;
vice versa[vaɪs ˈvɜsə] #反之亦然; 反过来也一样;
high throughput #高流通量
scalable messaging#可扩展的消息传递 #可伸缩消息;
Distributed,reliable publish-subscribe # 分布的,可靠的发布订阅系统
large-scale #大规模的
extensive and mature fault tolerance #广泛成熟的容错能力;
Data Analysis and Exploration # 数据分析与探索;
Inspired by #启发于
low latency #低延时
dialect #方言
abstraction layer #抽象层
Hue # Hadoop User Experience
hypothetical scenario #假设性的情节
YARN #Yet Another Resource Negotiator 另外一种资源协调器
RDDs #Resilient Distributed Datasets




Spark 大规模数据处理引擎;a large-scale data processing engine
#通用的;
#运行于hadoop集群,数据处理基于HDFS;
#支持广泛的工作负载 Supports a wide range of workloads
#机器学习
#商业智能
#流处理
#批处理
#查询结构化数据
Hadoop is a framework for distributed storage and processing#分布式存储处理框架
Hadoop Core includes HDFS for storage and YARN for cluster resource management#HDFS文件存储 和 YARN集群资源管理
The Hadoop ecosystem includes many components for
# Ingesting data (Flume 日志, Sqoop 数据库, Kafka 消息队列)
# Storing data (HDFS 文件, Kudu列式)
# Processing data (Spark 内核, Hadoop MapReduce 文件系统, Pig)
# Modeling data as tables for SQL access (Impala 高效低延时, Hive)
# Exploring data (Hue, Search) 数据探索



第一章 Hadoop如何数据存储和查看

▪ How the Apache Hadoop Distributed File System (HDFS) stores data across a cluster
关键概念:
#A cluster is a group of computers working together
#A node is an individual computer in the cluster
#A daemon(后台进程) is a program running on a node
#HDFS is a file system written in Java
#Sits on top of a native file system 位于本地文件系统之上(Such as ext3, ext4, or xfs)
#Provides redundant storage for massive amounts of data,为大量数据提供冗余存储;
#HDFS performs best with a “modest” number of large files #处理大小适中的文件性能最好 100M以上
#Files in HDFS are “write once”#文件一次写入,不能随机修改
#HDFS is optimized for large, streaming reads of files 针对文件的大流量读取进行优化


组成部件:There are three main components of a cluster
#Storage 存储
#Resource Management资源管理
#Processing数据处理

存储原理:
#Data files are split into blocks (default 128MB) which are distributed at load time#分成128M的块分布式加载
#The actual blocks are stored on cluster worker nodes running the Hadoop HDFS Data Node service#实际通过运行数据节点服务存储在集群工作节点
#Each block is replicated on multiple DataNodes (default 3x) 每个块都会重复的存储在不同的节点
#A cluster master node runs the HDFS Name Node service, which stores file metadata# 主节点存储元数据,运行name node服务

命令操作:
存放文件到HDFS
$ hdfs dfs -put foo.txt foo.txt

查看
$ hdfs dfs -ls ##用户目录
$ hdfs dfs –ls / ##root根目录

显示文件内容
$ hdfs dfs -cat /user/fred/bar.txt

从HDFS取出文件
$ hdfs dfs -get /user/fred/bar.txt baz.txt

创建目录
$ hdfs dfs -mkdir input

删除文件或目录
$ hdfs dfs -rm input_old/myfile #删除文件
$ hdfs dfs -rm input_old/* #通配符删除符合条件的文件
$ hdfs dfs -rm -r input_old #删除目录


▪ How to use HDFS using the Hue File Browser or the hdfs command
#相当于网盘 Create, move, rename, upload, download, and delete directories and files

重点回顾:
▪ The Hadoop Distributed File System (HDFS) is the main storage layer for Hadoop 主要的数据存储层
▪ HDFS chunks data into blocks and distributes them across the cluster when data is stored
▪ HDFS clusters are managed by a single NameNode running on a master node #由运行在主节点的name node管理集群
▪ Access HDFS using Hue, the hdfs command, or the HDFS API #HUE,命令行,API链接HDFS


第二章 Hadoop如何分布式处理
▪ How Hadoop YARN provides cluster resource management for distributed data processing
YARN is the Hadoop processing layer that contains (Resource Manager 资源管理器 job scheduler 作业调度器)

▪ Resource Manager 资源管理器
# Runs on master node
# Global resource scheduler
# Arbitrates system resources between competing applications 应用争夺时的资源仲裁系统
# Has a pluggable scheduler to support different algorithms(such as Capacity or Fair Scheduler)可插入调度器处理不同的算法
▪ Node Manger 节点管理器
# Runs on worker nodes #运行在工作节点#数据节点
# Communicates with RM #与资源管理交互
# Manages node resources #管理节点资源
# Launches containers #启动容器
▪ Containers
# Containers allocate a certain amount of resources (memory, CPU cores) on a workernode 工作节点分配的一定数量的资源
# Applications run in one or more containers 应用运行在一个或者多个容器
# Applications request containers from RM 应用想RM申请容器
▪ ApplicationMaster (AM)
# One per application #每个应用一个
# Specific Framework/application #特殊框架/应用
# Runs in a container #在一个容器中运行
# Requests more containers to run application tasks #申请更多容器运行应用任务


▪ Each application consists of one or more containers
# The ApplicationMaster runs in one container
# The application’s distributed processes (JVMs) run in other containers
# The processes run in parallel, and are managed by the AM
# The processes are called executors in Apache Spark and tasks in Hadoop MapReduce

▪ Developers need to be able to 开发人员可以做的事情
# Submit jobs (applications) to run on the YARN cluster 提交作业到YARN集群
# Monitor and manage jobs 监控管理作业

▪ There are three major YARN tools for developers 开发人员可以使用的工具
# The Hue Job Browser
# The YARN ResourceManager web UI
# The YARN command line

▪ YARN administrators can use Cloudera Manager 管理人员工具
# May also be helpful for developers
# Included in Cloudera Express and Cloudera Enterprise





▪ How to use Hue, the YARN web UI, or the yarn command to monitor your cluster
查看应用清单
$ yarn application -list

杀掉应用
$ yarn application -kill app-id

查看日志
$ yarn logs -applicationId app-id

帮助
$ yarn -help

重点回顾:
▪ YARN manages resources in a Hadoop cluster and schedules applications;
▪ Worker nodes run NodeManager daemons, managed by a ResourceManager on a master node;
▪ Applications running on YARN consist of an ApplicationMaster and one or more executors;
▪ Use Hue, the YARN ResourceManager web UI,the yarn command, or Cloudera Manager to monitor applications;


第三章 Spark 基础

▪ How Spark SQL fits into the Spark stack # Spark SQL如何适用于Spark 堆栈
▪ How to start and use the Python and Scala Spark shells  # 如何使用Spark shells
▪ What is a DataFrame and how to perform simple queries #什么是数据框架,如何运行简单查询
▪ Spark provides a stack of libraries built on core Spark
# Core Spark provides the fundamental Spark abstraction: Resilient Distributed Datasets (RDDs) 核心是基本的Spark抽象概念-弹性分布数据集
# Spark SQL works with structured data #工作于结构化数据上的Spark SQL
# MLlib supports scalable machine learning #MLlib可扩展的机器学习
# Spark Streaming applications process data in real time #Spark Streaming实时应用处理
# GraphX works with graphs and graph-parallel computation #GraphX 图形和图形并行计算


Spark SQL
▪ The DataFrame and Dataset API 数据集,数据框架接口
# The primary entry point for developing Spark applications 关键点是开发Spark应用
# DataFrames and Datasets are abstractions for representing structured data 数据集,数据框是抽象的结构数据
▪ Catalyst Optimizer—an extensible optimization framework 加速调优器#一个可扩展的调优框架
▪ A SQL engine and command line interface SQL引擎和命令行界面



Spark Shell
▪ The Spark shell provides an interactive Spark environment 提供交互性的环境
# Often called a REPL, or Read/Evaluate/Print Loop
# For learning, testing, data exploration, or ad hoc analytics
# You can run the Spark shell using either Python or Scala

▪ You typically run the Spark shell on a gateway node

# pyspark #master yarn
▪ The possible values for the master option include
# yarn
# spark://masternode:port (Spark Standalone)
# mesos://masternode:port (Mesos)
# local[*] runs locally with as many threads as cores (default)
# local[n] runs locally with n threads
# local runs locally with a single thread

▪ DataFrames and Datasets are the primary representation of data in Spark 数据框和数据集 是主要的代表数据;
▪ DataFrames represent structured data in a tabular form 以表格形式表示结构化数据
# DataFrames model data similar to tables in an RDBMS
# DataFrames consist of a collection of loosely typed Row objects 由一组松散类型的行对象组成
# Rows are organized into columns described by a schema

▪ Datasets represent data as a collection of objects of a specified type
# Datasets are strongly-typed—type checking is enforced at compile time rather than run time
# An associated schema maps object properties to a table-like structure of rows and columns
# Datasets are only defined in Scala and Java
# DataFrame is an alias for Dataset[Row]—Datasets containing Row objects

▪ DataFrames contain an ordered collection of Row objects
# Rows contain an ordered collection of values
# Row values can be basic types (such as integers, strings, and floats) or collections of those types (such as arrays and lists)
# A schema maps column names and types to the values in a row


练习:
>mkdir /usr/jacksun
>cd /usr/jacksun
>vi users.json
>hdfs dfs -mkdir /usr/jacksun
>hdfs dfs -put users.json /usr/jacksun/users.json
>pyspark
>usersDF =spark.read.json("users.json")
>usersDF.printSchema()
root
|# age: long (nullable = true)
|# name: string (nullable = true)
|# pcode: string (nullable = true)
> usersDF.show()
+##+###-+##-+
| age| name|pcode|
+##+###-+##-+
|null| Alice|94304|
| 30|Brayden|94304|
| 19| Carla|10036|
| 46| Diana| null|
|null|Etienne|94104|
+##+###-+##-+
▪ There are two main types of DataFrame operations
# Transformations create a new DataFrame based on existing one(s) #创建新的dataframe
# Transformations are executed in parallel by the application’s executors #并行执行
# Actions output data values from the DataFrame #actions 输出数据框结果
# Output is typically returned from the executors to the main Spark program (called the driver) or saved to a file

DataFrame 中 Actions 与 Transformations区别?
#transformations 是数据在内存中的变化;如复制 截取等
#actions 从executors输出数据到 driver 或者保存数据到文件;

▪ Some common DataFrame actions include
# count: returns the number of rows
# first: returns the first row (synonym for head())
# take(n): returns the first n rows as an array (synonym for head(n))
# show(n): display the first n rows in tabular form (default is 20 rows)
# collect: returns all the rows in the DataFrame as an array
# write: save the data to a file or other data source


▪ Transformations create a new DataFrame based on an existing one
创建一个新的transformations
# The new DataFrame may have the same schema or a different one
▪ Transformations do not return any values or data to the driver
不返回任何值给driver
# Data remains distributed across the application’s executors
▪ DataFrames are immutable 不可变的
data frames里面的数据是不可变的;
# Data in a DataFrame is never modified
# Use transformations to create a new DataFrame with the data you need

▪ Common transformations include
─ select: only the specified columns are included
─ where: only rows where the specified expression is true are included(synonym for filter)
─ orderBy: rows are sorted by the specified column(s) (synonym for sort)
─ join: joins two DataFrames on the specified column(s)
─ limit(n): creates a new DataFrame with only the first n rows

>>> udf.select("age","name").where("age>20").orderBy("age").limit(1).show()




▪ Apache Spark is a framework for analyzing and processing big data
#一种分析和处理大数据的结构
▪ The Python and Scala Spark shells are command line REPLs for executing Spark interactively
─ Spark applications run in batch mode outside the shell
▪ DataFrames represent structured data in tabular form by applying a schema
#表结构
▪ Types of DataFrame operations
─ Transformations create new DataFrames by transforming data in existing ones
─ Actions collect values in a DataFrame and either save them or return them to the Spark driver
▪ A query consists of a sequence of transformations followed by an action
#一个查询被认为一系列有序的transformations 跟随在action后;












posted @ 2018-12-21 15:27  JackSun924  阅读(313)  评论(0编辑  收藏  举报