Step1:
/home/training/map.py
#!/usr/bin/python
import sys
for myline in sys.stdin:
myline =
myline.strip()
words =
myline.split()
for myword in
words:
print '%s\t%s' %
(myword, 1)
~
Step2:
/home/training/red.py
#!/usr/bin/python
from operator import itemgetter
import sys
current_word = ""
current_count = 0
word = ""
for myline in sys.stdin:
myline =
myline.strip()
word, count =
myline.split('\t', 1)
try:
count =
int(count)
except ValueError:
# Count was not
a number, so silently ignore this line continue
continue
if current_word ==
word:
current_count +=
count
else:
if current_word:
print '%s\t%s'
% (current_word, current_count)
current_count =
count
current_word =
word
# Do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' %
(current_word, current_count)
~
Step3:
Create input.txt and
move to Hadoop
Step4:
[training@localhost ~]$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-*streaming*.jar
-file /home/training/map.py -mapper
/home/training/map.py -file /home/training/red.py -reducer /home/training/red.py -input
/user/training/web/input.txt -output /user/training/web/stout
Step5:
[training@localhost ~]$ hadoop fs -cat web/stout/part-00000
dd 1
rr 2
tt 1
ww 2
input we have given:
[training@localhost ~]$ cat input.txt
dd rr
rr tt
ww
ww
No comments:
Post a Comment