DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Interactive Programming for Spark


Interactive programming, IAP for short, is the upgrade version of REPL, which meets Immediate Feedback Principle.

Synonyms: on-the-fly-programming, just in time programming, conversational programming.

IAP means when the programmer works on the program file, every time the file is saved, the toolchain provides on-the-fly feedbacks which are helpful for ongoing development.

IAP of each language is implemented on a specific toolchain (mostly implemented in the same language). For example, ghcid for Haskell, RStudio for R, Spyder for Python, and sbt for Scala.

Unit Test

with FunSuite

Based on Funsuite of ScalaTest, and spark-testing-base, we can write unit test of Spark with SparkContext:

$ take sparkdemo
$ mkdir -p src/{main/{resources,scala},test/{resources,scala}}
$ cat << EOF > src/test/scala/Test1.scala
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext

class Test1 extends FunSuite with SharedSparkContext {
  test("test with SparkContext") {
    val list = List(1, 2, 3, 4)
    val rdd = sc.parallelize(list)

    assert(rdd.count === list.length)
  }
}
EOF

$ cat << EOF > build.sbt
import scala.sys.process._
name := "Simple Project"
version := "0.0.1"
scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
libraryDependencies += "com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test"

fork in Test := true
parallelExecution in Test := false
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")

lazy val submit = taskKey[Unit]("submit to Spark")
submit := {"spark-submit --class SimpleApp target/scala-2.11/simple-project_2.11-0.0.1.jar" !}

addCommandAlias("rs", "; clean; package; submit")
EOF

$ sbt
sbt:Simple Project> ~ test

spark-testing-base provides SharedSparkContext, so there's no need to initialize and destroy SparkSession manually.

spark-testing-base also has many useful traits, such as DataFrameGenerator. With it and ScalaCheck, you can implemente property-based test, like QuckCheck in Haskell.

See its wiki for the full list.

Integration Test

You can run Spark application line by line in spark-shell or pyspark shell. But the following implementation is file based, which means in each loop of the iteration, a complete Spark job (a jar file) is submitted to Spark cluster and return a result.

With the scaffold above, add a source file and run Integration test with user defined command rs (run spark):

$ cat << EOF > src/main/scala/SimpleApp.scala
import org.apache.spark.sql.SparkSession
object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "/home/leo/apps/spark-2.2.0-bin-hadoop2.7/README.md"
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}
EOF

$ sbt
sbt:Simple Project> ~ rs

Now change texts in file SimpleApp.scala and save, sbt will clean the project build, regenerate the jar file and submit to Spark.

P.S.: if the contents in build.sbt is changed, run reload in sbt console.

Ref:

  • FunSuite



Published

Sep 14, 2018

Last Updated

Sep 14, 2018

Category

Tech

Tags

  • iap 1
  • sbt 2
  • scala 20
  • spark 21

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor