PART 1: Word Count in Hadoop (Python – Hadoop Streaming)
Step 1: Create Input File
nano input.txt
Example content:
hello world
hello hadoop
big data hadoop
hello spark
Step 2: Create Mapper
nano mapper.py
mapper.py
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")
Make executable:
chmod +x mapper.py
Step 3: Create Reducer
nano reducer.py
reducer.py
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)
if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word:
print(f"{current_word}\t{current_count}")
Make executable:
chmod +x reducer.py
🧪 Test Locally
cat input.txt | ./mapper.py | sort | ./reducer.py
Run in Hadoop
Upload to HDFS
hdfs dfs -mkdir /wordcount
hdfs dfs -put input.txt /wordcount
Run Streaming Job
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
View Output
hdfs dfs -cat /wordcount_output/part-00000
PART 2: Install Apache Spark on Ubuntu
Step 1: Download Spark
Go to official website of Apache Spark
Download latest Spark (pre-built for Hadoop).
Or use terminal:
wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
(Adjust version if needed.)
Step 2: Extract Spark
tar -xvzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark
Step 3: Set Environment Variables
Open:
nano ~/.bashrc
Add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Apply:
source ~/.bashrc
Step 4: Verify Spark
spark-shell
If Spark starts successfully → Installation successful ✅
Exit:
:quit
PART 3: Word Count in Spark (PySpark)
Method 1: Using PySpark Script
Create file:
nano spark_wordcount.py
spark_wordcount.py
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("WordCount") \
.getOrCreate()
sc = spark.sparkContext
# Read input file
text_file = sc.textFile("input.txt")
# Perform word count
word_counts = (text_file
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b))
# Save output
word_counts.saveAsTextFile("spark_output")
spark.stop()
Run Spark Word Count
spark-submit spark_wordcount.py
View Output
cat spark_output/part-00000