Step 1: Create Input File
Open Ubuntu terminal:
nano input.txt
Add sample content:
hello world
hello hadoop
hello world
big data hadoop
Save and exit.
Step 2: Create Mapper Script
nano mapper.py
mapper.py
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")
Make it executable:
chmod +x mapper.py
Step 3: Create Reducer Script
nano reducer.py
reducer.py
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)
if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word:
print(f"{current_word}\t{current_count}")
Make it executable:
chmod +x reducer.py
Step 4: Test Locally (Without Hadoop)
cat input.txt | ./mapper.py | sort | ./reducer.py
Output:
big 1
data 1
hadoop 2
hello 3
world 2
✔ Works correctly.
Step 5: Run Using Hadoop (Hadoop Streaming)
Create HDFS directory
hdfs dfs -mkdir /wordcount
Upload input file
hdfs dfs -put input.txt /wordcount
Run Hadoop Streaming Job
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
Step 6: View Output
hdfs dfs -cat /wordcount_output/part-00000
No comments:
Post a Comment