Search This Blog

Sunday, 1 March 2026

word count ubuntu hadoop and spark

 

PART 1: Word Count in Hadoop (Python – Hadoop Streaming)


 Step 1: Create Input File

nano input.txt

Example content:

hello world
hello hadoop
big data hadoop
hello spark

Step 2: Create Mapper

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")

Make executable:

chmod +x mapper.py

 Step 3: Create Reducer

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")

Make executable:

chmod +x reducer.py

🧪 Test Locally

cat input.txt | ./mapper.py | sort | ./reducer.py

 Run in Hadoop

Upload to HDFS

hdfs dfs -mkdir /wordcount
hdfs dfs -put input.txt /wordcount

Run Streaming Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

View Output

hdfs dfs -cat /wordcount_output/part-00000

 PART 2: Install Apache Spark on Ubuntu


 Step 1: Download Spark

Go to official website of Apache Spark
Download latest Spark (pre-built for Hadoop).

Or use terminal:

wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

(Adjust version if needed.)


 Step 2: Extract Spark

tar -xvzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

 Step 3: Set Environment Variables

Open:

nano ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply:

source ~/.bashrc

 Step 4: Verify Spark

spark-shell

If Spark starts successfully → Installation successful ✅

Exit:

:quit

 PART 3: Word Count in Spark (PySpark)


 Method 1: Using PySpark Script

Create file:

nano spark_wordcount.py

 spark_wordcount.py

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("WordCount") \
.getOrCreate()

sc = spark.sparkContext

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = (text_file
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b))

# Save output
word_counts.saveAsTextFile("spark_output")

spark.stop()

 Run Spark Word Count

spark-submit spark_wordcount.py

 View Output

cat spark_output/part-00000

movie lens data ubuntu hadoop map reduce

 

Problem Statement

From MovieLens tags.csv, find all tags associated with each movie.


 Step 1: Understand MovieLens Tags Data

Assume tags.csv format:

userId,movieId,tag,timestamp

Example:

1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815

We want output like:

296 funny,classic,crime
1 Pixar,animation

 Step 2: Create Input File

nano tags.csv

Paste sample data (remove header if present):

1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815

🧩 Step 3: Create Mapper Script

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
if not line:
continue

parts = line.split(",")

# Skip header if present
if parts[0] == "userId":
continue

try:
movieId = parts[1]
tag = parts[2]

print(f"{movieId}\t{tag}")

except:
continue

Make executable:

chmod +x mapper.py

 Step 4: Create Reducer Script

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_movie = None
tags = []

for line in sys.stdin:
line = line.strip()
movieId, tag = line.split("\t")

if movieId == current_movie:
if tag not in tags:
tags.append(tag)
else:
if current_movie:
print(f"{current_movie}\t{','.join(tags)}")
current_movie = movieId
tags = [tag]

if current_movie:
print(f"{current_movie}\t{','.join(tags)}")

Make executable:

chmod +x reducer.py

 Step 5: Test Locally

cat tags.csv | ./mapper.py | sort | ./reducer.py

 Expected Output

1 Pixar,animation
296 funny,classic,crime

 Step 6: Run in Hadoop Streaming

 Create HDFS directory

hdfs dfs -mkdir /movietags

 Upload file

hdfs dfs -put tags.csv /movietags

 Run Hadoop Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /movietags/tags.csv \
-output /movietags_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

 Step 7: View Output

hdfs dfs -cat /movietags_output/part-00000

map reduce weather data ubuntu hadoop using python

 

Weather Data Mining using Hadoop Streaming


 Step 1: Create Input File

Open terminal:

nano weather_data.txt

Add sample data:

2023-10-01,25,60,0
2023-10-02,30,70,5
2023-10-03,15,80,10
2023-10-04,10,90,15
2023-10-05,35,50,0

Format:

Date,Temperature,Humidity,Precipitation

 Step 2: Create Mapper Script

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
if not line:
continue

try:
date, temp, humidity, precipitation = line.split(",")

temp = float(temp)
humidity = float(humidity)
precipitation = float(precipitation)

# Weather condition logic
if precipitation > 0:
message = "Rainy day"
elif temp >= 35:
message = "Very Hot day"
elif temp >= 30:
message = "Hot day"
elif temp <= 10:
message = "Very Cold day"
elif temp <= 15:
message = "Cold day"
elif humidity > 85:
message = "Humid day"
else:
message = "Pleasant day"

print(f"{date}\t{message}")

except:
continue

Make executable:

chmod +x mapper.py

 Step 3: Create Reducer Script

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
if line:
print(line)

Make executable:

chmod +x reducer.py

👉 Note: Reducer is simple because classification is done in mapper.


 Step 4: Test Locally (Without Hadoop)

cat weather_data.txt | ./mapper.py | sort | ./reducer.py

 Expected Output

2023-10-01 Pleasant day
2023-10-02 Rainy day
2023-10-03 Rainy day
2023-10-04 Rainy day
2023-10-05 Very Hot day

 Step 5: Run in Hadoop Streaming

 Create HDFS directory

hdfs dfs -mkdir /weather

 Upload file

hdfs dfs -put weather_data.txt /weather

 Run Hadoop job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /weather/weather_data.txt \
-output /weather_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

 Step 6: View Output

hdfs dfs -cat /weather_output/part-00000

map reduce WORD COUNT ubuntu hadoop

 

Step 1: Create Input File

Open Ubuntu terminal:

nano input.txt

Add sample content:

hello world
hello hadoop
hello world
big data hadoop

Save and exit.


Step 2: Create Mapper Script

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
words = line.split()

for word in words:
print(f"{word}\t1")

Make it executable:

chmod +x mapper.py

 Step 3: Create Reducer Script

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")

Make it executable:

chmod +x reducer.py

 Step 4: Test Locally (Without Hadoop)

cat input.txt | ./mapper.py | sort | ./reducer.py

Output:

big 1
data 1
hadoop 2
hello 3
world 2

✔ Works correctly.


 Step 5: Run Using Hadoop (Hadoop Streaming)

 Create HDFS directory

hdfs dfs -mkdir /wordcount

 Upload input file

hdfs dfs -put input.txt /wordcount

 Run Hadoop Streaming Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

 Step 6: View Output

hdfs dfs -cat /wordcount_output/part-00000

Hadoop Analytics

word count ubuntu hadoop and spark

  PART 1: Word Count in Hadoop (Python – Hadoop Streaming)  Step 1: Create Input File nano input.txt Example content: hello world hell...