spark-submit python egg 解决三方件依赖问题
假设spark里用到了purl这个三方件,https://github.com/ultrabluewolf/p.url,他还额外依赖futures这个三方件(six的话,anaconda2自带)。
pyspark 代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster( "local" ).setAppName( "My test App" ) sc = SparkContext(conf = conf) #from purl import Purl def get_purl(x): from purl import Purl url = Purl( 'https://github.com/search?q={}' . format (x)) return str (url.add_query( 'name' , 'dog' )) int_rdd = sc.parallelize([ 1 , 2 , 3 , 4 ]) r = int_rdd. map ( lambda x: get_purl(x)) print (r.collect()) |
下面说明如何编译打包egg。
通过https://pypi.org/project/p.url/#files 下载源码。然后解压:
python setup.py bdist_egg
在dist目录下可以看到有egg文件生成。
同理,下载https://pypi.org/project/future/#files futures的源码,然后解压生成egg文件。
最终运行:
spark-submit --py-files p.url-0.1.0a4-py2.7.egg,future-0.17.1-py2.7.egg main_dep.py
结果输出:
1 | [ 'https://github.com/search?q=1&name=dog' , 'https://github.com/search?q=2&name=dog' , 'https://github.com/search?q=3&name=dog' , 'https://github.com/search?q=4&name=dog' ] |
补充官方文档,比较蛋疼,没有说具体操作:
Complex Dependencies
Some operations rely on complex packages that also have many dependencies. For example, the following code snippet imports the Python pandas data analysis library:
def import_pandas(x): import pandas return x int_rdd = sc.parallelize([1, 2, 3, 4]) int_rdd.map(lambda x: import_pandas(x)) int_rdd.collect()
pandas depends on NumPy, SciPy, and many other packages. Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors.
Limitations of Distributing Egg Files
In both self-contained and complex dependency scenarios, sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
2018-07-03 scrapy 6023 telnet查看爬虫引擎相关状态
2017-07-03 深入理解groupByKey、reduceByKey区别——本质就是一个local machine的reduce操作