Problem Statement

From MovieLens tags.csv, find all tags associated with each movie.

Step 1: Understand MovieLens Tags Data

Assume tags.csv format:


userId,movieId,tag,timestamp

Example:


1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815

We want output like:


296    funny,classic,crime
1      Pixar,animation

Step 2: Create Input File


nano tags.csv

Paste sample data (remove header if present):


1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815

🧩 Step 3: Create Mapper Script


nano mapper.py

mapper.py


#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue

    parts = line.split(",")

    # Skip header if present
    if parts[0] == "userId":
        continue

    try:
        movieId = parts[1]
        tag = parts[2]

        print(f"{movieId}\t{tag}")

    except:
        continue

Make executable:


chmod +x mapper.py

Step 4: Create Reducer Script


nano reducer.py

reducer.py


#!/usr/bin/env python3
import sys

current_movie = None
tags = []

for line in sys.stdin:
    line = line.strip()
    movieId, tag = line.split("\t")

    if movieId == current_movie:
        if tag not in tags:
            tags.append(tag)
    else:
        if current_movie:
            print(f"{current_movie}\t{','.join(tags)}")
        current_movie = movieId
        tags = [tag]

if current_movie:
    print(f"{current_movie}\t{','.join(tags)}")

Make executable:


chmod +x reducer.py

Step 5: Test Locally


cat tags.csv | ./mapper.py | sort | ./reducer.py

Expected Output


1    Pixar,animation
296  funny,classic,crime

Step 6: Run in Hadoop Streaming

Create HDFS directory


hdfs dfs -mkdir /movietags

Upload file


hdfs dfs -put tags.csv /movietags

Run Hadoop Job


hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /movietags/tags.csv \
-output /movietags_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

Step 7: View Output


hdfs dfs -cat /movietags_output/part-00000

Hadoop learning pot

Search This Blog

Sunday, 1 March 2026

movie lens data ubuntu hadoop map reduce

Problem Statement

Step 1: Understand MovieLens Tags Data

Step 2: Create Input File

🧩 Step 3: Create Mapper Script

mapper.py

Step 4: Create Reducer Script

reducer.py

Step 5: Test Locally

Expected Output

Step 6: Run in Hadoop Streaming

Create HDFS directory

Upload file

Run Hadoop Job

Step 7: View Output

No comments:

Post a Comment

Hadoop Analytics

AI & DS HUE EXPERIMENT

Search This Blog