muyue123

2021年10月28日

摘要：分析spark常见的问题不外乎oom：我们首先看一下Spark 的内存模型：Spark在一个Executor中的内存分为三块，一块是execution内存，一块是storage内存，一块是other内存。execution内存是执行内存，文档中说join，aggregate都在这部分内存中执行，sh 阅读全文

posted @ 2021-10-28 17:34 muyue123 阅读(501) 评论(0) 推荐(0) 编辑

2021年9月24日

计算分位数

摘要： 4个分位数的取法： df1 = spark.createDataFrame([(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10),(2,1),(2,10),(2,100)],['id','cnt']) cnt_med_1 = F. 阅读全文

posted @ 2021-09-24 13:44 muyue123 阅读(312) 评论(0) 推荐(0) 编辑

2021年9月2日

开窗函数

摘要： select id,cnt,sum(cnt) over w as sum_cntfrom( select 'a' as id, 1 as cnt union all select 'a' as id, 9 as cnt union all select 'a' as id, 4 as cnt uni 阅读全文

posted @ 2021-09-02 15:07 muyue123 阅读(95) 评论(0) 推荐(0) 编辑

2021年8月23日

matplotlib绘制表格

摘要： # 例子1 import matplotlib.pyplot as plt data = [[1,2,3,4],[6,5,4,3],[1,3,5,1]] table = plt.table(cellText=data, colLabels=['A', 'B', 'C', 'D'], loc='cen 阅读全文

posted @ 2021-08-23 15:17 muyue123 阅读(926) 评论(0) 推荐(0) 编辑

2021年8月12日

config中指定运行资源

摘要： spark = SparkSession.builder. \ appName(app_name). \ enableHiveSupport(). \ config("spark.debug.maxToStringFields", "100"). \ config("spark.executor.m 阅读全文

posted @ 2021-08-12 15:22 muyue123 阅读(45) 评论(0) 推荐(0) 编辑

2021年8月10日

删除数据

摘要：方法一： ALTER TABLE kuming.tableName DELETE WHERE toDate(insert_at_timestamp)='2020-07-21'; 方法二： ALTER TABLE kuming.tableName DELETE WHERE insert_at_time 阅读全文

posted @ 2021-08-10 15:36 muyue123 阅读(127) 评论(0) 推荐(0) 编辑

2021年7月22日

数据插入动态分区

摘要：（前人写的不错，很实用，负责任转发）转自：http://www.crazyant.net/1197.html Hive的insert语句能够从查询语句中获取数据，并同时将数据Load到目标表中。现在假定有一个已有数据的表staged_employees（雇员信息全量表），所属国家cnty和所属州st 阅读全文

posted @ 2021-07-22 11:32 muyue123 阅读(813) 评论(0) 推荐(0) 编辑

2021年7月15日

实现维度表自动刷新的一种方式

摘要：是采用的将更新的维度表放在最新的分区的形式。 # coding=utf-8 from pyspark.sql.types import IntegerType, StructType from pyspark.sql import SparkSession import datetime from 阅读全文

posted @ 2021-07-15 17:44 muyue123 阅读(160) 评论(0) 推荐(0) 编辑

2021年7月5日

sync命令

摘要： aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "0*" --include "1*" --include "2*" --in 阅读全文

posted @ 2021-07-05 10:34 muyue123 阅读(82) 评论(0) 推荐(0) 编辑

2021年6月24日

source

摘要： #CSV mySchema = StructType().add("id", IntegerType(), True).add("name",StringType(),True) df = spark.readStream.option("sep",",").option("header","fal 阅读全文

posted @ 2021-06-24 16:08 muyue123 阅读(145) 评论(0) 推荐(0) 编辑

公告