0026.Spark 基础

Posted on   锦喵卫指挥使  阅读(101)  评论(0编辑  收藏  举报


20-05-Spark基于文件目录的单点恢复

root@bigdata00:~# start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [192.168.16.143]
192.168.16.143: starting namenode, logging to /root/training/hadoop-2.7.3/logs/hadoop-root-namenode-bigdata00.out
localhost: starting datanode, logging to /root/training/hadoop-2.7.3/logs/hadoop-root-datanode-bigdata00.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /root/training/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-bigdata00.out
starting yarn daemons
starting resourcemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-resourcemanager-bigdata00.out
localhost: starting nodemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-nodemanager-bigdata00.out
root@bigdata00:~# jps
2964 ResourceManager
2807 SecondaryNameNode
2247 NameNode
2507 DataNode
3515 Jps
3199 NodeManager
root@bigdata00:~# cd /root/training/spark-2.1.0-bin-hadoop2.7
root@bigdata00:~/training/spark-2.1.0-bin-hadoop2.7# sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /root/training/spark-2.1.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-bigdata00.out
192.168.16.143: starting org.apache.spark.deploy.worker.Worker, logging to /root/training/spark-2.1.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-bigdata00.out
root@bigdata00:~/training/spark-2.1.0-bin-hadoop2.7# bin/spark-shel1 --master sproot@bigdata00:~/training/spark-2.1.0-bin-hadoop2.7# bin/spark-shel --master sparoot@bigdata00:~/training/spark-2.1.0-bin-hadoop2.7# bin/spark-shell --master sproot@bigdata00:~/training/spark-2.1.0-bin-hadoop2.7# bin/spark-shell --master spark://192.168.16.143:7077 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/10/27 18:17:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/10/27 18:17:08 WARN Utils: Your hostname, bigdata00 resolves to a loopback address: 127.0.1.1; using 192.168.16.143 instead (on interface eth0)
20/10/27 18:17:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/10/27 18:17:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.16.143:4040
Spark context available as 'sc' (master = spark://192.168.16.143:7077, app id = app-20201027181710-0000).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.


sbin/start-all.sh
bin/spark-shell --master spark://192.168.16.143:7077


20-06-基于ZooKeeper的Standby的Master


20-07-使用spark-submit


20-08-使用spark-shell

蒙特卡罗求PI(圆周率).png

![](0026.Spark 基础.assets/蒙特卡罗求PI(圆周率).png)

单步运行WordCount.png

![](0026.Spark 基础.assets/单步运行WordCount.png)


20-09-在IDE中开发Scala版本的WordCount


import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

/*
 * 通过Spark Submit提交
 * bin/spark-submit --master spark://bigdata111:7077 --class day1025.MyWordCount /root/temp/demo1.jar hdfs://bigdata111:9000/input/data.txt hdfs://bigdata111:9000/output/1025/demo1
 */

object MyWordCount {
  def main(args: Array[String]): Unit = {
    //创建任务的配置信息
    //如果设置Master=local,表示运行在本地模式上
    //如果运行集群模式上,不需要设置Master
    //val conf = new SparkConf().setAppName("MyWordCount").setMaster("local")
    val conf = new SparkConf().setAppName("MyWordCount")
    
    //创建一个SparkContext对象
    val sc = new SparkContext(conf)
    
    //执行WordCount
    val result = sc.textFile(args(0))
      .flatMap(_.split(" "))
      .map((_,1))
      .reduceByKey(_+_)
      
    //打印在屏幕上
    //result.foreach(println)
      
    //输出到HDFS
    result.saveAsTextFile(args(1))
      
    //停止SparkContext
    sc.stop()
  }
}

20-10-在IDE中开发Java版本的WordCount

import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;

/*
 * 使用spark submit提交
 * bin/spark-submit --master spark://bigdata111:7077 --class demo.JavaWordCount /root/temp/demo2.jar hdfs://bigdata111:9000/input/data.txt
 */

public class JavaWordCount {

	public static void main(String[] args) {
		//运行在本地模式,可以设置断点
		SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
		
		//运行在集群模式
		//SparkConf conf = new SparkConf().setAppName("JavaWordCount");
		
		//创建一个SparkContext对象: JavaSparkContext对象
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		//读入HDFS的数据
		JavaRDD<String> rdd1 = sc.textFile(args[0]);
		
		/*
		 * 分词
		 * FlatMapFunction:接口,用于处理分词的操作
		 * 泛型:String 读入的每一句话
		 *     U:      返回值 ---> String 单词
		 */
		JavaRDD<String> rdd2 = rdd1.flatMap(new FlatMapFunction<String, String>() {

			@Override
			public Iterator<String> call(String input) throws Exception {
				//数据: I love Beijing
				//分词
				return Arrays.asList(input.split(" ")).iterator();
			}
		});
		
		/*
		 * 每个单词记一次数  (k2  v2)
		 * Beijing ---> (Beijing,1)
		 * 参数:
		 * String:单词
		 * k2 v2不解释
		 */
		JavaPairRDD<String, Integer> rdd3 = rdd2.mapToPair(new PairFunction<String, String, Integer>() {

			@Override
			public Tuple2<String, Integer> call(String word) throws Exception {
				return new Tuple2<String, Integer>(word, 1);
			}
			
		});
		
		//执行Reduce的操作
		JavaPairRDD<String, Integer> rdd4 = rdd3.reduceByKey(new Function2<Integer, Integer, Integer>() {
			
			@Override
			public Integer call(Integer a, Integer b) throws Exception {
				//累加
				return a+b;
			}
		});
		
		//执行计算(Action),把结果打印在屏幕上
		List<Tuple2<String,Integer>> result = rdd4.collect();
		
		for(Tuple2<String,Integer> tuple:result){
			System.out.println(tuple._1+"\t"+tuple._2);
		}
		
		//停止JavaSparkContext对象
		sc.stop();
	}
}
编辑推荐:
· 深入理解 Mybatis 分库分表执行原理
· 如何打造一个高并发系统?
· .NET Core GC压缩(compact_phase)底层原理浅谈
· 现代计算机视觉入门之:什么是图片特征编码
· .NET 9 new features-C#13新的锁类型和语义
阅读排行:
· Spring AI + Ollama 实现 deepseek-r1 的API服务和调用
· 《HelloGitHub》第 106 期
· 数据库服务器 SQL Server 版本升级公告
· 深入理解Mybatis分库分表执行原理
· 使用 Dify + LLM 构建精确任务处理应用

随笔 - 44, 文章 - 0, 评论 - 1, 阅读 - 9030

Copyright © 2025 锦喵卫指挥使
Powered by .NET 9.0 on Kubernetes

点击右上角即可分享
微信分享提示