Based on Writing an Hadoop MapReduce Program in Python.
- Create mapper and reducer script and make them executable with
chmod 755 *.py
:
mapper.py:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
reducer.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
-
Get input text file and put them into hdfs: download the text version of The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson, The Notebooks of Leonardo Da Vinci and Ulysses by James Joyce. Then upload them to hdfs:
$ hadoop fs -mkdir gutenberg $ hadoop fs -put pg.txt gutenberg/ $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/ -output /user/hduser/gutenberg-output
You have to make sure the "gutenberg-output" folder has not existed. When finished, you can see the result with:
$ hadoop fs -ls gutenberg-output
$ hadoop fs -cat gutenberg-output/part-00000
Verified on CDH 4.3, built on 8 CentOS 6.3 64bit host, Python 2.6.6, 2014-8-7.