验证pyspark提交参数指定环境变量生效
一,背景需要在我们已经内置的流程化提交平台中使用用户自己的python环境
二,我们自己中台页面中默认执行的提交命令如下
/opt/apps/ali/spark-3.5.2-bin-hadoop3-scala2.13/bin/spark-submit
--master yarn --deploy-mode cluster --name print.py_6 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python3.6/python3.6/bin/python --archives hdfs:///ali/ai/python3.6.zip#python3.6 --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python3.6/python3.6/bin/python --executor-cores 2 --executor-memory 8g file:/opt/apps/ali/print.py
三,用户提交添加参数
spark.yarn.dist.archives="hdfs://ali/testpysaprk/dns_fenxi.tar.gz#pyenv";spark.executorEnv.PYTHONPATH=pyenv/lib/python3.10/site-packages; spark.pyspark.python=pyenv/python3.10/bin/python3.10
我们平台会默认将他们这个添加到配置中的参数添加到提交命令中
/opt/apps/ali/spark-3.5.2-bin-hadoop3-scala2.13/bin/spark-submit
--master yarn --deploy-mode cluster --name print.py_6 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python3.6/python3.6/bin/python --archives hdfs:///ali/ai/python3.6.zip#python3.6 --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python3.6/python3.6/bin/python --executor-cores 2 --executor-memory 8g file:/opt/apps/ali/print.py spark.yarn.dist.archives="hdfs://ali/testpysaprk/dns_fenxi.tar.gz#pyenv";spark.executorEnv.PYTHONPATH=pyenv/lib/python3.10/site-packages; spark.pyspark.python=pyenv/python3.10/bin/python3.10
程序运行报错
submit-spark: Exception in thread "main" java.io.FileNotFoundException: File file:/apps/"/opt/apps/dns_fenxi.tar.gz#pyenv" does not exist
四,发现问题,更改提交命令,将命令中的“”去掉
spark.yarn.dist.archives=hdfs://everdc/mzqtestpysaprk/dns_det.tar.gz#pyenv;spark.executorEnv.PYTHONPATH=./pyenv/dns_det/bin/python3.10/site-packages; spark.pyspark.python=./pyenv/dns_det/bin/python3.10
提交成功,运行也正常
opt/apps/spark_ali/bin/spark-submit --master yarn --deploy-mode cluster --name testprint.py_237 --conf spark.yarn.submit.waitAppCompletion=false --principal hdfs/ali14@ali.COM --keytab /opt/apps/ali_cluster_file/tickets/215/keytab --conf spark.pyspark.python=./pyenv/dns_det/bin/python3.10 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python3.6/python3.6/bin/python --conf spark.executorEnv.PYTHONPATH=./pyenv/dns_det/bin/python3.10/site-packages --archives /opt/apps/python3.6.zip#python3.6 --driver-memory 8g --conf spark.default.parallelism=10 --num-executors 1 --conf spark.yarn.dist.archives="hdfs://ali/mzqtestpysaprk/dns_det.tar.gz#pyenv" --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python3.6/python3.6/bin/python --executor-cores 2 --executor-memory 8g --queue root.default file:/opt/apps/resource/testprint.py
五,spark中指定参数中指定python环境的优先级
我再提交命令中有自带的python.3.6的环境,同时有用户提交的3.10的环境,最后通过脚本发现用户的环境生效了
最后对比发现 spark.pyspark.python配置的优先级最高