摘要:
分析spark常见的问题不外乎oom:我们首先看一下Spark 的内存模型:Spark在一个Executor中的内存分为三块,一块是execution内存,一块是storage内存,一块是other内存。execution内存是执行内存,文档中说join,aggregate都在这部分内存中执行,sh 阅读全文
摘要:
4个分位数的取法: df1 = spark.createDataFrame([(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10),(2,1),(2,10),(2,100)],['id','cnt']) cnt_med_1 = F. 阅读全文
摘要:
select id,cnt,sum(cnt) over w as sum_cntfrom( select 'a' as id, 1 as cnt union all select 'a' as id, 9 as cnt union all select 'a' as id, 4 as cnt uni 阅读全文
摘要:
# 例子1 import matplotlib.pyplot as plt data = [[1,2,3,4],[6,5,4,3],[1,3,5,1]] table = plt.table(cellText=data, colLabels=['A', 'B', 'C', 'D'], loc='cen 阅读全文
摘要:
spark = SparkSession.builder. \ appName(app_name). \ enableHiveSupport(). \ config("spark.debug.maxToStringFields", "100"). \ config("spark.executor.m 阅读全文
摘要:
方法一: ALTER TABLE kuming.tableName DELETE WHERE toDate(insert_at_timestamp)='2020-07-21'; 方法二: ALTER TABLE kuming.tableName DELETE WHERE insert_at_time 阅读全文
摘要:
(前人写的不错,很实用,负责任转发)转自:http://www.crazyant.net/1197.html Hive的insert语句能够从查询语句中获取数据,并同时将数据Load到目标表中。现在假定有一个已有数据的表staged_employees(雇员信息全量表),所属国家cnty和所属州st 阅读全文
摘要:
是采用的将更新的维度表放在最新的分区的形式。 # coding=utf-8 from pyspark.sql.types import IntegerType, StructType from pyspark.sql import SparkSession import datetime from 阅读全文
摘要:
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "0*" --include "1*" --include "2*" --in 阅读全文
摘要:
#CSV mySchema = StructType().add("id", IntegerType(), True).add("name",StringType(),True) df = spark.readStream.option("sep",",").option("header","fal 阅读全文