Search This Blog

Sunday, 1 March 2026

movie lens data ubuntu hadoop map reduce

 

Problem Statement

From MovieLens tags.csv, find all tags associated with each movie.


 Step 1: Understand MovieLens Tags Data

Assume tags.csv format:

userId,movieId,tag,timestamp

Example:

1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815

We want output like:

296 funny,classic,crime
1 Pixar,animation

 Step 2: Create Input File

nano tags.csv

Paste sample data (remove header if present):

1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815

🧩 Step 3: Create Mapper Script

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
if not line:
continue

parts = line.split(",")

# Skip header if present
if parts[0] == "userId":
continue

try:
movieId = parts[1]
tag = parts[2]

print(f"{movieId}\t{tag}")

except:
continue

Make executable:

chmod +x mapper.py

 Step 4: Create Reducer Script

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_movie = None
tags = []

for line in sys.stdin:
line = line.strip()
movieId, tag = line.split("\t")

if movieId == current_movie:
if tag not in tags:
tags.append(tag)
else:
if current_movie:
print(f"{current_movie}\t{','.join(tags)}")
current_movie = movieId
tags = [tag]

if current_movie:
print(f"{current_movie}\t{','.join(tags)}")

Make executable:

chmod +x reducer.py

 Step 5: Test Locally

cat tags.csv | ./mapper.py | sort | ./reducer.py

 Expected Output

1 Pixar,animation
296 funny,classic,crime

 Step 6: Run in Hadoop Streaming

 Create HDFS directory

hdfs dfs -mkdir /movietags

 Upload file

hdfs dfs -put tags.csv /movietags

 Run Hadoop Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /movietags/tags.csv \
-output /movietags_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

 Step 7: View Output

hdfs dfs -cat /movietags_output/part-00000

No comments:

Post a Comment

Hadoop Analytics

word count ubuntu hadoop and spark

  PART 1: Word Count in Hadoop (Python – Hadoop Streaming)  Step 1: Create Input File nano input.txt Example content: hello world hell...