pyspark 环境搭建和相关操作redis ,es
pyspark学习官网:https://spark.apache.org/docs/latest/api/python/index.html
一.环境搭建
参考官网:https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
1. 创建虚拟环境, 指定python包
2. 切换到虚拟环境,安装你所需要的python相关模块包
3. 把整个虚拟环境打成.zip
4. 将 zip上传的hadfs
5. spark-submit 指定python包的路径
可以参考 https://dandelioncloud.cn/article/details/1589470996832964609
二. pyspark数据redis
1. 先要在之前的虚拟环境中安装,redis的python相关包
''' pip3 install pyspark pip3 install redis-py-cluster==2.1.3 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com ''' from pyspark.sql import SparkSession from rediscluster import RedisCluster redis_hosts = [ {"host": "192.168.2.150", "port": 6379}, {"host": "192.168.2.150", "port": 6380}, {"host": "192.168.2.150", "port": 6381}, {"host": "192.168.2.150", "port": 6382} ] def write_hive2redis(df): json_rdd = df.rdd.map(lambda row:row.asDict()) def write2redis(partition_data_list): redis_conn = RedisCluster(startup_nodes=redis_hosts, password="", decode_responses=True) for dic in partition_data_list: id = dic.get("id") name = dic.get("name") # 往redis中写数据 redis_conn.set(id,name) json_rdd.foreachPartition(write2redis) if __name__ == '__main__': # 创建SparkSession对象 spark = SparkSession.builder.appName("HiveExample").enableHiveSupport().getOrCreate() sql="select * from table1" df = spark.sql(sql) write_hive2redis(df) spark.stop()
或写redis这样批量写
def to_redis(part, batch=500): redis_pool = redis.ConnectionPool(host='127.0.0.1', port=26379, db=10, password='password') redis_cli = redis.StrictRedis(connection_pool=redis_pool) cnt = 0 pipeline = redis_cli.pipeline() for row in part: pipeline.set(row.name, "\t".join([row.name, row.source, row.end_format])) cnt += 1 if cnt > 0 and cnt % batch == 0: pipeline.execute() if cnt % batch != 0: pipeline.execute() pipeline.close() redis_cli.close() sdf = spark.read.csv("/home/testuser/data/csv/", schema=data_schema, header=False, sep="\t") sdf.show() # 按照自定义的写入方式和格式 分片写入到redis sdf.foreachPartition(functools.partial(to_redis, batch=500))
三. pyspark将数据写入es
参考:https://www.jianshu.com/p/3ccd902f0a03
先下载jar包: elasticsearch-spark-20_2.11-7.6.2.jar
在提交spark-submit时,--jars elasticsearch-spark-20_2.11-7.6.2.jar
或 如果有网络的情况下:可以用参数 --packages=org.elasticsearch:elasticsearch-spark-20_2.11-7.6.2.jar
park=SparkSession.builder.getOrCreate() # 读es df=spark.read.format('org.elasticsearch.spark.sql')\ .option("spark.es.nodes",es_url)\ .option("spark.es.port",es_port)\ .option("es.net.http.auth.user",es_user)\ .option("es.net.http.auth.pass",es_pass)\ .option("es.mapping.id","id")\ .option("es.nodes.wan.only","true")\ .option("es.write.operation","upsert")\ .option('es.resource', 'cancer_example/_doc') # 写es df.write.format('org.elasticsearch.spark.sql')\ .option("spark.es.nodes",es_url)\ .option("spark.es.port",es_port)\ .option("es.net.http.auth.user",es_user)\ .option("es.net.http.auth.pass",es_pass)\ .option("es.mapping.id","id")\ .option("es.nodes.wan.only","true")\ .option("es.write.operation","upsert")\ .option('es.resource', 'cancer_example/_doc').mode("Append").save()
原文链接:https://blog.csdn.net/nanfeizhenkuangou/article/details/121894010
四 spark-submit 参数详解
参考: https://blog.csdn.net/XnCSD/article/details/100586224
有疑问可以加wx:18179641802,进行探讨