Search This Blog

Sunday 13 August 2017

HADOOP STREAMING USING PYTHON

Step1:
/home/training/map.py

#!/usr/bin/python
import sys
for myline in sys.stdin:
 myline = myline.strip()
 words = myline.split()
 for myword in words:
  print '%s\t%s' % (myword, 1)
~                          
Step2:

/home/training/red.py
#!/usr/bin/python
from operator import itemgetter
import sys
current_word = ""
current_count = 0
word = ""
for myline in sys.stdin:
 myline = myline.strip()
 word, count = myline.split('\t', 1)
 try:
    count = int(count)
 except ValueError:
   # Count was not a number, so silently ignore this line continue
    continue
 if current_word == word:
   current_count += count
 else:
   if current_word:
    print '%s\t%s' % (current_word, current_count)
   current_count = count
   current_word = word
# Do not forget to output the last word if needed!
if current_word == word:
   print '%s\t%s' % (current_word, current_count)
~           

         
Step3:
Create input.txt  and move to Hadoop

Step4:

[training@localhost ~]$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-*streaming*.jar -file /home/training/map.py    -mapper /home/training/map.py -file /home/training/red.py   -reducer /home/training/red.py -input /user/training/web/input.txt -output /user/training/web/stout
Step5:
[training@localhost ~]$ hadoop fs -cat  web/stout/part-00000
dd      1
rr      2
tt      1
ww      2



input we have given:

[training@localhost ~]$ cat input.txt
dd rr
rr tt
ww

ww

No comments:

Post a Comment

Hadoop Analytics

NewolympicData

  Alison Bartosik 21 United States 2004 08-29-04 Synchronized Swimming 0 0 2 2 Anastasiya Davydova 21 Russia 2004 0...