From "Spark Cookbook" by Rishi Yadav. Chapter 5: "Word count using Streaming".

Step 1: Starting Input Server

On CentOS 6.8, start a listening server as follows:

sudo yum install nmap
ncat -l 8585

Step 2: Start Spark job

Run the following codes in Spark shell:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import StorageLevel._
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 8585, MEMORY_ONLY)
val wordsFlatMap = lines.flatMap(_.split(" "))
val wordsMap = wordsFlatMap.map(w => (w, 1))
val wordCount = wordsMap.reduceByKey((a, b) => (a + b))
wordCount.print
ssc.start

Now input some text in server window. After press enter, you can see the word counting result is printed in the Spark shell window.

Note

socketTextStream above creates an instance of SocketInputDStream which uses java.net.Socket, a client socket. So if you run Spark streaming job without starting a listening server, you will get a connection refused error. See 'Connection Refused' error while running Spark Streaming on local machine for details.
netcat for CentOS 6.8 has a bug, make it can't used as a listening server.

Spark Streaming Hello World

Step 1: Starting Input Server

Step 2: Start Spark job

Note

Published

Last Updated

Category

Tags

Contact