DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

PySpark Environment on Ubuntu 16.04


In this environment Anaconda pakcages (pandas, numpy, etc), IPython and PySpark packages (SparkSession, SparkContext, etc) are all available, where IPython REPL is used as our development environment.

Build the environment

  1. Download and install miniconda and create Anaconda environment;

  2. Download and extract Spark 2.3.3 into $HOME/apps;

  3. Add environment variable to ~/.zshenv: export SPARK_HOME="$HOME/apps/spark-2.3.3-bin-hadoop2.7" export PATH=$PATH:$SPARK_HOME/bin To start IPython REPL with PySpark and Anaconda, you have the following 2 options:

Option 1: Run PySpark with Anaconda (recommended):

export PYSPARK_PYTHON="$HOME/apps/miniconda3/envs/anaconda/bin/ipython"
conda activate anaconda
pyspark

Option 2: Run IPython with PySpark:

export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH"
conda activate anaconda
ipython

Option 3: put all dependencies in a conda env:

conda create -n portablePySpark python=3.5 ipython
conda install -n portablePySpark -c conda-forge pyspark
conda install -n portablePySpark -c conda-forge pudb
conda install -n portablePySpark -c cyclus java-jdk

Verify

cat << EOF > demo.py
from pyspark.sql.session import SparkSession
from pyspark.sql.types import FloatType
spark = SparkSession.builder.master("local[*]").appName("demo").getOrCreate()
mylist = [1.0, 2.3, 3.4]
df = spark.createDataFrame(mylist, FloatType())
df.show()
EOF

. activate portablePySpark
python demo.py

Start Local PySpark Environment

For option 1:

cat << EOF > $HOME/.local/bin/pyspark_in_anaconda
#!/bin/zsh
export PYSPARK_PYTHON="$HOME/apps/miniconda3/envs/anaconda/bin/ipython"
conda activate anaconda
pyspark
EOF

Start a REPL with . pyspark_in_anaconda.

For option 2:

cat << EOF > $HOME/.local/bin/startPySparkEnv
#!/bin/zsh
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH"
conda activate anaconda
ipython
EOF

Start a REPL with . startPySparkEnv.



Published

Jan 3, 2018

Last Updated

Feb 21, 2019

Category

Tech

Tags

  • pyspark 5
  • ubuntu 61

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor