In this environment Anaconda pakcages (pandas, numpy, etc), IPython and PySpark packages (SparkSession, SparkContext, etc) are all available, where IPython REPL is used as our development environment.
Build the environment
-
Download and install miniconda and create Anaconda environment;
-
Download and extract Spark 2.3.3 into $HOME/apps;
-
Add environment variable to ~/.zshenv:
export SPARK_HOME="$HOME/apps/spark-2.3.3-bin-hadoop2.7" export PATH=$PATH:$SPARK_HOME/bin
To start IPython REPL with PySpark and Anaconda, you have the following 2 options:
Option 1: Run PySpark with Anaconda (recommended):
export PYSPARK_PYTHON="$HOME/apps/miniconda3/envs/anaconda/bin/ipython"
conda activate anaconda
pyspark
Option 2: Run IPython with PySpark:
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH"
conda activate anaconda
ipython
Option 3: put all dependencies in a conda env:
conda create -n portablePySpark python=3.5 ipython
conda install -n portablePySpark -c conda-forge pyspark
conda install -n portablePySpark -c conda-forge pudb
conda install -n portablePySpark -c cyclus java-jdk
Verify
cat << EOF > demo.py
from pyspark.sql.session import SparkSession
from pyspark.sql.types import FloatType
spark = SparkSession.builder.master("local[*]").appName("demo").getOrCreate()
mylist = [1.0, 2.3, 3.4]
df = spark.createDataFrame(mylist, FloatType())
df.show()
EOF
. activate portablePySpark
python demo.py
Start Local PySpark Environment
For option 1:
cat << EOF > $HOME/.local/bin/pyspark_in_anaconda
#!/bin/zsh
export PYSPARK_PYTHON="$HOME/apps/miniconda3/envs/anaconda/bin/ipython"
conda activate anaconda
pyspark
EOF
Start a REPL with . pyspark_in_anaconda
.
For option 2:
cat << EOF > $HOME/.local/bin/startPySparkEnv
#!/bin/zsh
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH"
conda activate anaconda
ipython
EOF
Start a REPL with . startPySparkEnv
.