From "Spark Cookbook" by Rishi Yadav. Chapter 5: "Word count using Streaming".
Step 1: Starting Input Server
On CentOS 6.8, start a listening server as follows:
sudo yum install nmap
ncat -l 8585
Step 2: Start Spark job
Run the following codes in Spark shell:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import StorageLevel._
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 8585, MEMORY_ONLY)
val wordsFlatMap = lines.flatMap(_.split(" "))
val wordsMap = wordsFlatMap.map(w => (w, 1))
val wordCount = wordsMap.reduceByKey((a, b) => (a + b))
wordCount.print
ssc.start
Now input some text in server window. After press enter, you can see the word counting result is printed in the Spark shell window.
Note
-
socketTextStream
above creates an instance ofSocketInputDStream
which usesjava.net.Socket
, a client socket. So if you run Spark streaming job without starting a listening server, you will get a connection refused error. See 'Connection Refused' error while running Spark Streaming on local machine for details. -
netcat for CentOS 6.8 has a bug, make it can't used as a listening server.