PART 1: Word Count in Hadoop (Python – Hadoop Streaming)

Step 1: Create Input File


nano input.txt

Example content:


hello world
hello hadoop
big data hadoop
hello spark

Step 2: Create Mapper


nano mapper.py

mapper.py


#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print(f"{word}\t1")

Make executable:


chmod +x mapper.py

Step 3: Create Reducer


nano reducer.py

reducer.py


#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split("\t")
    count = int(count)

    if word == current_word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_word = word
        current_count = count

if current_word:
    print(f"{current_word}\t{current_count}")

Make executable:


chmod +x reducer.py

🧪 Test Locally


cat input.txt | ./mapper.py | sort | ./reducer.py

Run in Hadoop

Upload to HDFS


hdfs dfs -mkdir /wordcount
hdfs dfs -put input.txt /wordcount

Run Streaming Job


hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

View Output


hdfs dfs -cat /wordcount_output/part-00000

PART 2: Install Apache Spark on Ubuntu

Step 1: Download Spark

Go to official website of Apache Spark
Download latest Spark (pre-built for Hadoop).

Or use terminal:


wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

(Adjust version if needed.)

Step 2: Extract Spark


tar -xvzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

Step 3: Set Environment Variables

Open:


nano ~/.bashrc

Add:


export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply:


source ~/.bashrc

Step 4: Verify Spark


spark-shell

If Spark starts successfully → Installation successful ✅

Exit:


:quit

PART 3: Word Count in Spark (PySpark)

Method 1: Using PySpark Script

Create file:


nano spark_wordcount.py

spark_wordcount.py


from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("WordCount") \
    .getOrCreate()

sc = spark.sparkContext

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = (text_file
               .flatMap(lambda line: line.split())
               .map(lambda word: (word, 1))
               .reduceByKey(lambda a, b: a + b))

# Save output
word_counts.saveAsTextFile("spark_output")

spark.stop()

Run Spark Word Count


spark-submit spark_wordcount.py

View Output


cat spark_output/part-00000

Hadoop learning pot

Search This Blog

Sunday, 1 March 2026

word count ubuntu hadoop and spark

PART 1: Word Count in Hadoop (Python – Hadoop Streaming)

Step 1: Create Input File

Step 2: Create Mapper

mapper.py

Step 3: Create Reducer

reducer.py

🧪 Test Locally

Run in Hadoop

Upload to HDFS

Run Streaming Job

View Output

PART 2: Install Apache Spark on Ubuntu

Step 1: Download Spark

Step 2: Extract Spark

Step 3: Set Environment Variables

Step 4: Verify Spark

PART 3: Word Count in Spark (PySpark)

Method 1: Using PySpark Script

spark_wordcount.py

Run Spark Word Count

View Output

No comments:

Post a Comment

Hadoop Analytics

AI & DS HUE EXPERIMENT

Search This Blog