Spark-广播变量
- 当我们产生了几百个或是几千个task这些task后期都需要使用到一份共同的数据,假如这个数据量有1G,这些task后期运行完成需要内存开销 几百或几千乘以1g,内存开销还是特别大的,特别浪费资源。而spark提供一个叫数据共享机制
广播变量
。可以把共同数据从Driver段下发到每一个参与计算的worker节点上,每个worker节点保留该数据一个副本(该副本是只读的,不可改变),后面在每一个worker上运行大量task都共享该副本数据。这样,假如我们有2个worker参与计算,该数据会下发2份,这里就大大减少内存开销。
1.通过spark实现IP地址查询
package cn.wc
import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object ip_ocation {
def ip2Long(ip:String):Long = {
val ips:Array[String] = ip.split("\\.")
var ipNum:Long = 0L
for (i <- ips) {
ipNum = i.toLong | ipNum << 8L
}
ipNum
}
def binarySearch(ipNum:Long, city_ip_Array:Array[(String,String,String,String)]):Int = {
var start = 0
var end = city_ip_Array.length - 1
while (start <= end) {
val middle = (start + end) / 2
if (ipNum >= city_ip_Array(middle)._1.toLong && ipNum <= city_ip_Array(middle)._2.toLong) {
return middle
}
if (ipNum < city_ip_Array(middle)._1.toLong) {
end = middle - 1
}
if (ipNum > city_ip_Array(middle)._2.toLong) {
start = middle + 1
}
}
-1
}
def main(args: Array[String]): Unit = {
val sparkConf:SparkConf = new SparkConf().setAppName("IpOcation").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("warn")
val city_id_rdd:RDD[(String,String,String,String)] = sc.textFile("J:\\ips.txt").map(x => x.split("\\|")).map(x => (x(2), x(3), x(x.length - 2), x(x.length - 1)))
val cityTpBroadcase: Broadcast[Array[(String,String,String,String)]] = sc.broadcast(city_id_rdd.collect())
val ipsRDD:RDD[String] = sc.textFile("J:\\flow.format").map(x => x.split("\\|")(1))
val result:RDD[((String,String), Int)] = ipsRDD.mapPartitions(iter => {
val city_ip_Array:Array[(String,String,String,String)] = cityTpBroadcase.value
iter.map(ip => {
val ipNum:Long = ip2Long(ip)
val index:Int = binarySearch(ipNum, city_ip_Array)
val value: (String,String,String,String) = city_ip_Array(index)
((value._3,value._4), 1)
})
})
val finalResult: RDD[((String,String), Int)] = result.reduceByKey(_+_)
finalResult.foreach(println)
finalResult.foreachPartition(iter => {
val connection: Connection = DriverManager.getConnection("jdbc:mysql://127.0.0.1:3306/spark", "root", "123")
val sql = "insert into flow(longitude, latitude, total) values (?,?,?)"
try {
val ps: PreparedStatement = connection.prepareStatement(sql)
iter.foreach(line => {
ps.setString(1, line._1._1)
ps.setString(2, line._1._2)
ps.setInt(3, line._2)
ps.execute()
})
} catch {
case e: Exception => e.printStackTrace()
} finally {
if (connection!= null) {
connection.close()
}
}
})
sc.stop()
}
}
2.spark读取文件数据保存到hbase中
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.1</version>
</dependency>
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· 单线程的Redis速度为什么快?
· 展开说说关于C#中ORM框架的用法!
· Pantheons:用 TypeScript 打造主流大模型对话的一站式集成库
2020-06-21 Django报错: 'OrderingFilter' object has no attribute 'get_schema_fields'