4.RDD操作
1. RDD创建
- 从本地文件系统中加载数据创建RDD
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401110855468-808713412.png)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401111158561-340932319.png)
- 从HDFS加载数据创建RDD
# 启动HDFS
start-all.sh
# 查看HDFS文件
hdfs dfs -ls 查看的文件目录
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401110111198-1454703800.png)
# 上传文件到HDFS
hdfs dfs -put 本地文件路径 HDFS目的路径
# 查看HDFS文件
hdfs dfs -cat 文件名称
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401105951651-1464610403.png)
# HDFS加载数据创建RDD
lines=sc.textFile("hdfs://localhost:9000/user/llc.txt").foreach(print)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401111612564-880042603.png)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401110405471-955503567.png)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401110430048-1274154691.png)
# 停止hdfs
stop-all.sh
![](https://img2022.cnblogs.com/blog/2764973/202203/2764973-20220318191535293-1908274252.png)
- 通过并行集合(列表)创建RDD
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401112007624-133942379.png)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401160815435-421229284.png)
2. RDD操作
转换操作
- filter(func)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401142445449-1826650239.png)
- map(func)
- 字符串分词
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401143041018-2001590875.png)
- 数字加100
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401143218975-1004237362.png)
- 字符串加固定前缀
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401143601238-1519870773.png)
- flatMap(func)
- 分词
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401144105545-1577958895.png)
- 单词映射成键值对
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401144131898-1743085024.png)
- reduceByKey()
- 统计词频,累加
- 乘法法则
- groupByKey()
- 单词分组
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401150043278-1903257917.png)
- sortByKey()
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401150519489-1525532792.png)
- sortBy()
- 词频统计按词频排序
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401150644081-705482590.png)
- RDD写入文本文件
- 写入本地文件系统,并查看结果
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401151832869-1478854058.png)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401152458859-716987402.png)
行动操作
- foreach(print)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401154829404-538086175.png)
- foreach(lambda a:print(a.upper()))
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401155127109-1441368837.png)
- collect()
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401155227179-1007121021.png)
- count()
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401155338968-641056381.png)
- first()
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401155526052-1223497883.png)
- take(n)
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401155551172-1134659223.png)
- reduce()
- 数值型的rdd元素做累加
![](https://img2022.cnblogs.com/blog/2764973/202204/2764973-20220401155700507-1222597737.png)