Search This Blog

Sunday, 1 March 2026

map reduce WORD COUNT ubuntu hadoop

 

Step 1: Create Input File

Open Ubuntu terminal:

nano input.txt

Add sample content:

hello world
hello hadoop
hello world
big data hadoop

Save and exit.


Step 2: Create Mapper Script

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
words = line.split()

for word in words:
print(f"{word}\t1")

Make it executable:

chmod +x mapper.py

 Step 3: Create Reducer Script

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")

Make it executable:

chmod +x reducer.py

 Step 4: Test Locally (Without Hadoop)

cat input.txt | ./mapper.py | sort | ./reducer.py

Output:

big 1
data 1
hadoop 2
hello 3
world 2

✔ Works correctly.


 Step 5: Run Using Hadoop (Hadoop Streaming)

 Create HDFS directory

hdfs dfs -mkdir /wordcount

 Upload input file

hdfs dfs -put input.txt /wordcount

 Run Hadoop Streaming Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

 Step 6: View Output

hdfs dfs -cat /wordcount_output/part-00000

No comments:

Post a Comment

Hadoop Analytics

word count ubuntu hadoop and spark

  PART 1: Word Count in Hadoop (Python – Hadoop Streaming)  Step 1: Create Input File nano input.txt Example content: hello world hell...