spark-submit python egg 解决三方件依赖问题

bonelee 2024-10-21 20:23:35 原文

假设spark里用到了purl这个三方件，https://github.com/ultrabluewolf/p.url，他还额外依赖futures这个三方件（six的话，anaconda2自带）。

pyspark 代码如下：

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My test App")

sc = SparkContext(conf=conf)

#from purl import Purl

def get_purl(x):

    from purl import Purl

    url = Purl('https://github.com/search?q={}'.format(x))

    return str(url.add_query('name', 'dog'))

int_rdd = sc.parallelize([1, 2, 3, 4])

r =int_rdd.map(lambda x: get_purl(x))

print(r.collect())

下面说明如何编译打包egg。

通过https://pypi.org/project/p.url/#files 下载源码。然后解压：

python setup.py bdist_egg

在dist目录下可以看到有egg文件生成。

同理，下载https://pypi.org/project/future/#files futures的源码，然后解压生成egg文件。

最终运行：

spark-submit --py-files p.url-0.1.0a4-py2.7.egg,future-0.17.1-py2.7.egg main_dep.py

结果输出：

['https://github.com/search?q=1&name=dog', 'https://github.com/search?q=2&name=dog', 'https://github.com/search?q=3&name=dog', 'https://github.com/search?q=4&name=dog']

补充官方文档，比较蛋疼，没有说具体操作：

Complex Dependencies

Some operations rely on complex packages that also have many dependencies. For example, the following code snippet imports the Python pandas data analysis library:

def import_pandas(x):

 import pandas

 return x

int_rdd = sc.parallelize([1, 2, 3, 4])

int_rdd.map(lambda x: import_pandas(x))

int_rdd.collect()

pandas depends on NumPy, SciPy, and many other packages. Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors.

Limitations of Distributing Egg Files

In both self-contained and complex dependency scenarios, sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

spark-submit python egg 解决三方件依赖问题的更多相关文章

[Dynamic Language] pyspark Python3.7环境设置及py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe解决!
pyspark Python3.7环境设置及py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spa ...
spark submit参数及调优(转载)
spark submit参数介绍你可以通过spark-submit --help或者spark-shell --help来查看这些参数. 使用格式: ./bin/spark-submit \ -- ...
windows命令行模式下无法打开python程序解决方法
今天刚开始学Python,首先编写一个简单地hello world程序,想在命令行模式运行,结果出现下面: 经过一番思考,发现用cd命令可以解决这件事,看下图: 这样就解决了.
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
spark编程python实例
spark编程python实例 ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PyS ...
HarmonyOS三方件开发指南（12）——cropper图片裁剪
鸿蒙入门指南,小白速来!0基础学习路线分享,高效学习方法,重点答疑解惑--->[课程入口] 目录:1. cropper组件功能介绍2. cropper使用方法3. cropper组件开发实现4. ...
HarmonyOS三方件开发指南(13)-SwipeLayout侧滑删除
鸿蒙入门指南,小白速来!0基础学习路线分享,高效学习方法,重点答疑解惑--->[课程入口] 目录:1. SwipeLayout组件功能介绍2. SwipeLayout使用方法3. SwipeLa ...
HarmonyOS三方件开发指南(14)-Glide组件功能介绍
<HarmonyOS三方件开发指南>系列文章合集引言在实际应用开发中,会用到大量图片处理,如:网络图片.本地图片.应用资源.二进制流.Uri对象等,虽然官方提供了PixelMap进行图 ...
HarmonyOS三方件开发指南(15)-LoadingView功能介绍
目录: 1. LoadingView组件功能介绍2. Lottie使用方法3. Lottie开发实现4.<HarmonyOS三方件开发指南>系列文章合集 1. LoadingView组件功 ...

随机推荐

【Python开发】PyInstaller打包Python程序
PyInstaller是一个能将Python程序转换成单个可执行文件的程序, 操作系统支持Windows, Linux, Mac OS X, Solaris和AIX.并且很多包都支持开箱即用,不依赖环 ...
为文献管理软件Mendeley设置代理
Mendeley由于某些原因无法在线同步,需要fq,在tools->option->connection中可以设置http代理或者sock5代理, sock5可以使用shadowsocks ...
IDEA 自定义代码模板
IDEA 自定义代码模板操作步骤:
Nginx 配置 HTTP 跳转 HTTPS-Linux运维日志
本文介绍 Nginx 访问 HTTP 跳转 HTTPS 的 4 种配置方式. rewrite Nginx rewrite 有四种 flag: break:在一个请求处理过程中将原来的 url 改写之后 ...
mysql疑问
翻译 API
Request http://fy.iciba.com/ajax.php?a=fy&f=auto&t=auto&w=love Pre 英译汉 Request http://fy ...
docker-compose的一些服务一直是restarting
1.查看日志 docker logs jenkins(镜像名字) 1.1 可能权限问题 1.2可能内存问题
【layui】layer.photos 相册层动态生成Img 中出现的问题的解决方案
layui版本:2.5.5 参照文档:https://www.jianshu.com/p/c594811fa882 他的3.8的解决方案有一些调整因为发现他的解决方式有些繁琐而最新的2.5.5版本中有 ...
drf安装与APIView初步分析
drf安装 1. pip install djangorestframework 2. 在settings文件中注册app : INSTALLED_APPS = [..., 'rest_framewo ...
a标签中target属性为“_blank”时存在安全问题
今天看到一个比较有意思的洞,虽然不够严重,但是却普遍存在各大src中熟悉js的朋友都应该知道当我们在调用window下的open方法创建一个新窗口的同时,我们可以获得一个创建窗口的opener句柄, ...