DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Run SparkR in RStudio


Setup the environment on a VM and run SparkR scripts:

  1. Start a Ubuntu 16.04 vm with vagrant;

  2. copy RStudio deb installer, JDK and Spark tarball file (jdk-8u161-linux-x64.tar.gz and spark-2.2.0-bin-hadoop2.7.tgz) to shared folder;

  3. Install r-base, gdebi via apt, install RStudio with gdebi, extract JDK and Spark;

  4. Create a new user (for example, leo) as the RStudio user (root is not permitted to login to RStudio);

  5. Login the vm as leo. Add JAVA_HOME, SPARK_HOME to ~/.Renviron;

  6. Add JAVA_HOME, SPARK_HOME and $JAVA_HOME/bin, $SPARK_HOME/bin to PATH in ~/.profile;

  7. Source ~/.profile, run spark-shell to verify the setup in above steps;

  8. Login RStudio with user leo, check if SPARK_HOME is set correctly with Sys.getenv();

  9. Load library SparkR. Convert an existing R dataframe to Spark dataframe;

Operations on the VM:

sudo su
apt update
apt install r-base gdebi-core
gdebi /vagrant/rstudio-server-1.1.383-amd64.deb
useradd -m leo -p $(openssl passwd -1 mypwd)

su - leo
mkdir ~/apps; cd ~/apps;
tar xf /vagrant/jdk-8u161-linux-x64.tar.gz
tar xf /vagrant/spark-2.2.0-bin-hadoop2.7.tgz

cat << EOF > .Renviron
JAVA_HOME="/home/leo/apps/jdk1.8.0_161"
SPARK_HOME="/home/leo/apps/spark-2.2.0-bin-hadoop2.7"
EOF

cat << EOF >> .profile
JAVA_HOME="$HOME/apps/jdk1.8.0_161"
SPARK_HOME="$HOME/apps/spark-2.2.0-bin-hadoop2.7"
PATH="$HOME/bin:$HOME/.local/bin:$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH"
EOF

Login RStudio with leo in browser and run:

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
spark <- sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
df <- as.DataFrame(faithful)
head(df)


Published

Mar 14, 2018

Last Updated

Mar 14, 2018

Category

Tech

Tags

  • RStudio 5
  • SparkR 2
  • Ubuntu 61

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor