DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Spark Streaming Hello World


From "Spark Cookbook" by Rishi Yadav. Chapter 5: "Word count using Streaming".

Step 1: Starting Input Server

On CentOS 6.8, start a listening server as follows:

sudo yum install nmap
ncat -l 8585

Step 2: Start Spark job

Run the following codes in Spark shell:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import StorageLevel._
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 8585, MEMORY_ONLY)
val wordsFlatMap = lines.flatMap(_.split(" "))
val wordsMap = wordsFlatMap.map(w => (w, 1))
val wordCount = wordsMap.reduceByKey((a, b) => (a + b))
wordCount.print
ssc.start

Now input some text in server window. After press enter, you can see the word counting result is printed in the Spark shell window.

Note

  • socketTextStream above creates an instance of SocketInputDStream which uses java.net.Socket, a client socket. So if you run Spark streaming job without starting a listening server, you will get a connection refused error. See 'Connection Refused' error while running Spark Streaming on local machine for details.

  • netcat for CentOS 6.8 has a bug, make it can't used as a listening server.



Published

Sep 14, 2017

Last Updated

Sep 14, 2017

Category

Tech

Tags

  • centos 25
  • spark 21
  • streaming 2

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor