DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Run MapReduce Jobs on Spark


Interactive Mode

Follow Quick Start of Spark 1.0.2 on Apache Spark website.

# hadoop fs -put alarm_data_for_explore-0501-0505.txt alarm_data_for_explore-0501-0505.txt 

# hadoop fs -ls
...
-rw-r--r--   3 root root  260780698 2014-08-15 16:45 alarm_data_for_explore-0501-0505.txt

# wc -l alarm_data_for_explore-0501-0505.txt
1362005

# head -1 alarm_data_for_explore-0501-0505.txt 
-2117657102|102|1000012276|License...|3|4|102|

# grep 007-002-00-000592 alarm_data_for_explore-0501-0505.txt | wc -l
12

# mv alarm_data_for_explore-0501-0505.txt aaa.txt       // this local file is unnecessary any more

# spark-shell

scala> val textFile = sc.textFile("alarm_data_for_explore-0501-0505.txt")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

scala> textFile.count()
...
res5: Long = 1362005

scala> textFile.first()
...
res8: String = -2117657102|102|1000012276|License...|3|4|102|

scala> textFile.filter(line => line.contains("007-002-00-000592")).count()
...
res12: Long = 12

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split("|").size).reduce((a, b) => Math.max(a, b))
...
res13: Int = 372

Run Spark Script

$ cat wfp-spark
val MIN_SUP = 0.0003
val MIN_CONF = 0
val MAX_RELATION_ORDER = 3
val DATA_FILE = "input"

val textFile = sc.textFile(DATA_FILE)
val weight = textFile.map(x => x -> x.split(",")(5).toFloat).cache
val data = textFile.groupBy(x => x.split(",")(0))

$ spark-shell -i wfp-spark
...

Now you are in a spark shell, all variables such as MIN_SUP, MIN_CONF are accessible.

Batch Mode

Follow Running Spark Applications in CDH 5 Installation Guide.

# locate spark-defaults.conf       // find out where is $SPARK_HOME
/etc/spark/conf.cloudera.spark/spark-defaults.conf
...

# ll /etc/spark/conf.cloudera.spark
...
-rw-r--r-- 1 root root 883 Aug 14 10:12 spark-env.sh

# cat /etc/spark/conf.cloudera.spark/spark-env.sh
...
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark
...

# source /etc/spark/conf.cloudera.spark/spark-env.sh     // load $SPARK_HOME, etc

# spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn $SPARK_HOME/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0.jar 10


Published

Aug 15, 2014

Last Updated

Aug 15, 2014

Category

Tech

Tags

  • cloudera 1
  • hadoop 7
  • scala 20
  • spark 21

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor