Search This Blog

Sunday, 1 March 2026

word count ubuntu hadoop and spark

 

PART 1: Word Count in Hadoop (Python – Hadoop Streaming)


 Step 1: Create Input File

nano input.txt

Example content:

hello world
hello hadoop
big data hadoop
hello spark

Step 2: Create Mapper

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")

Make executable:

chmod +x mapper.py

 Step 3: Create Reducer

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")

Make executable:

chmod +x reducer.py

🧪 Test Locally

cat input.txt | ./mapper.py | sort | ./reducer.py

 Run in Hadoop

Upload to HDFS

hdfs dfs -mkdir /wordcount
hdfs dfs -put input.txt /wordcount

Run Streaming Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

View Output

hdfs dfs -cat /wordcount_output/part-00000

 PART 2: Install Apache Spark on Ubuntu


 Step 1: Download Spark

Go to official website of Apache Spark
Download latest Spark (pre-built for Hadoop).

Or use terminal:

wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

(Adjust version if needed.)


 Step 2: Extract Spark

tar -xvzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

 Step 3: Set Environment Variables

Open:

nano ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply:

source ~/.bashrc

 Step 4: Verify Spark

spark-shell

If Spark starts successfully → Installation successful ✅

Exit:

:quit

 PART 3: Word Count in Spark (PySpark)


 Method 1: Using PySpark Script

Create file:

nano spark_wordcount.py

 spark_wordcount.py

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("WordCount") \
.getOrCreate()

sc = spark.sparkContext

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = (text_file
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b))

# Save output
word_counts.saveAsTextFile("spark_output")

spark.stop()

 Run Spark Word Count

spark-submit spark_wordcount.py

 View Output

cat spark_output/part-00000

No comments:

Post a Comment

Hadoop Analytics

word count ubuntu hadoop and spark

  PART 1: Word Count in Hadoop (Python – Hadoop Streaming)  Step 1: Create Input File nano input.txt Example content: hello world hell...