Hadoop learning pot: HADOOP STREAMING USING PYTHON

Sunday, 13 August 2017

HADOOP STREAMING USING PYTHON

Step1:

/home/training/map.py

#!/usr/bin/python

import sys

for myline in sys.stdin:

myline = myline.strip()

words = myline.split()

for myword in words:

print '%s\t%s' % (myword, 1)

Step2:

/home/training/red.py

#!/usr/bin/python

from operator import itemgetter

import sys

current_word = ""

current_count = 0

word = ""

for myline in sys.stdin:

myline = myline.strip()

word, count = myline.split('\t', 1)

try:

count = int(count)

except ValueError:

# Count was not a number, so silently ignore this line continue

continue

if current_word == word:

current_count += count

else:

if current_word:

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

# Do not forget to output the last word if needed!

if current_word == word:

print '%s\t%s' % (current_word, current_count)

Step3:

Create input.txt and move to Hadoop

Step4:

[training@localhost ~]$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-*streaming*.jar -file /home/training/map.py -mapper /home/training/map.py -file /home/training/red.py -reducer /home/training/red.py -input /user/training/web/input.txt -output /user/training/web/stout

Step5:

[training@localhost ~]$ hadoop fs -cat web/stout/part-00000

dd 1

rr 2

tt 1

ww 2

input we have given:

[training@localhost ~]$ cat input.txt

dd rr

rr tt

Hadoop learning pot

Search This Blog

Sunday, 13 August 2017

HADOOP STREAMING USING PYTHON

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Search This Blog