We have 2 log files. One contains logs created in 2014.11.22, the other all in 2014.11.23. Each have about 240 million logs in it, with file size 30GB.
The log in these files are like this:
"460015482006002","深圳市","广东省","2014-12-22 16:46:04","2014-12-22 16:46:04","42705","16111","460014270516111","MAP"
The number in the first column "460015482006002" represents a unique user, which is called "IMSI". We need to find all IMSI number only exists in 2014.11.22.
Given the huge size of log files, we compute with Apache Spark.
Create only22.scala:
import java.util.Date
import java.util.Calendar
import java.util.concurrent.TimeUnit
val SIG_DATA1 = "datamine/drawdata_20141122_cs.utf8.csv"
val SIG_DATA2 = "datamine/drawdata_20141123_cs.utf8.csv"
val USER_ID_POS = 1
def get_users(data: String): Set[String] = {
val raw_data = sc.textFile(data).map(_.split("\"").toList)
val raw_sig_map = raw_data.filter(x => x.size > 17)
val users = raw_sig_map.map(_(USER_ID_POS)).distinct
return users.toArray.toSet
}
val u1 = get_users(SIG_DATA1)
val u2 = get_users(SIG_DATA2)
val only22 = u1 -- u2
scala.tools.nsc.io.File("only22.txt").writeAll(only22.toString)
Run it: spark-shell --master spark://cloud142:7077 --driver-memory 6g --driver-cores 5 --total-executor-cores 28 --executor-memory 20g -i only22.scala
In file only22.txt, all elements in the set are written in one line.
So we have to replace all "," with newline character: sed -i 's/,/\n/g' only22.txt
;