2021年3月15日
摘要: 1 from pyspark.sql import HiveContext 2 from pyspark import SparkContext,SparkConf 3 import pyspark.sql.functions as F 4 from pyspark.sql import Spark 阅读全文
posted @ 2021-03-15 23:50 boye169 阅读(207) 评论(0) 推荐(0) 编辑
摘要: pathA = [('a',1),('b',1),('c',2),('d',3)] pathB = [('c',1),('d',3),('e',3),('f',4),] a = sc.parallelize(pathA) b = sc.parallelize(pathB) a.join(b).col 阅读全文
posted @ 2021-03-15 23:45 boye169 阅读(385) 评论(0) 推荐(0) 编辑
摘要: union、intersection、subtract、cartesian rdd1 = sc.parallelize([1,2,4,5,2,3]) rdd2 = sc.parallelize([4,6,5,7,8,6]) rdd1.union(rdd2).collect(): 所有rdd1和rdd 阅读全文
posted @ 2021-03-15 23:41 boye169 阅读(476) 评论(0) 推荐(0) 编辑
摘要: 【Example】 from pysoark. sql import SparkSession def split_line(line): try: return line.split(b"\t") except:pass def map_partitions(partitions): for li 阅读全文
posted @ 2021-03-15 23:31 boye169 阅读(453) 评论(0) 推荐(0) 编辑