Setup the environment on a VM and run SparkR scripts:
-
Start a Ubuntu 16.04 vm with vagrant;
-
copy RStudio deb installer, JDK and Spark tarball file (jdk-8u161-linux-x64.tar.gz and spark-2.2.0-bin-hadoop2.7.tgz) to shared folder;
-
Install
r-base
,gdebi
viaapt
, install RStudio with gdebi, extract JDK and Spark; -
Create a new user (for example, leo) as the RStudio user (root is not permitted to login to RStudio);
-
Login the vm as leo. Add JAVA_HOME, SPARK_HOME to ~/.Renviron;
-
Add JAVA_HOME, SPARK_HOME and
$JAVA_HOME/bin
,$SPARK_HOME/bin
to PATH in ~/.profile; -
Source ~/.profile, run
spark-shell
to verify the setup in above steps; -
Login RStudio with user leo, check if SPARK_HOME is set correctly with
Sys.getenv()
; -
Load library SparkR. Convert an existing R dataframe to Spark dataframe;
Operations on the VM:
sudo su
apt update
apt install r-base gdebi-core
gdebi /vagrant/rstudio-server-1.1.383-amd64.deb
useradd -m leo -p $(openssl passwd -1 mypwd)
su - leo
mkdir ~/apps; cd ~/apps;
tar xf /vagrant/jdk-8u161-linux-x64.tar.gz
tar xf /vagrant/spark-2.2.0-bin-hadoop2.7.tgz
cat << EOF > .Renviron
JAVA_HOME="/home/leo/apps/jdk1.8.0_161"
SPARK_HOME="/home/leo/apps/spark-2.2.0-bin-hadoop2.7"
EOF
cat << EOF >> .profile
JAVA_HOME="$HOME/apps/jdk1.8.0_161"
SPARK_HOME="$HOME/apps/spark-2.2.0-bin-hadoop2.7"
PATH="$HOME/bin:$HOME/.local/bin:$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH"
EOF
Login RStudio with leo in browser and run:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
spark <- sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
df <- as.DataFrame(faithful)
head(df)