Problem Statement
From MovieLens tags.csv, find all tags associated with each movie.
Step 1: Understand MovieLens Tags Data
Assume tags.csv format:
userId,movieId,tag,timestamp
Example:
1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815
We want output like:
296 funny,classic,crime
1 Pixar,animation
Step 2: Create Input File
nano tags.csv
Paste sample data (remove header if present):
1,296,funny,1147880044
2,296,classic,1147868817
3,1,Pixar,964982703
4,1,animation,964982224
5,296,crime,964983815
🧩 Step 3: Create Mapper Script
nano mapper.py
mapper.py
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
if not line:
continue
parts = line.split(",")
# Skip header if present
if parts[0] == "userId":
continue
try:
movieId = parts[1]
tag = parts[2]
print(f"{movieId}\t{tag}")
except:
continue
Make executable:
chmod +x mapper.py
Step 4: Create Reducer Script
nano reducer.py
reducer.py
#!/usr/bin/env python3
import sys
current_movie = None
tags = []
for line in sys.stdin:
line = line.strip()
movieId, tag = line.split("\t")
if movieId == current_movie:
if tag not in tags:
tags.append(tag)
else:
if current_movie:
print(f"{current_movie}\t{','.join(tags)}")
current_movie = movieId
tags = [tag]
if current_movie:
print(f"{current_movie}\t{','.join(tags)}")
Make executable:
chmod +x reducer.py
Step 5: Test Locally
cat tags.csv | ./mapper.py | sort | ./reducer.py
Expected Output
1 Pixar,animation
296 funny,classic,crime
Step 6: Run in Hadoop Streaming
Create HDFS directory
hdfs dfs -mkdir /movietags
Upload file
hdfs dfs -put tags.csv /movietags
Run Hadoop Job
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /movietags/tags.csv \
-output /movietags_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
Step 7: View Output
hdfs dfs -cat /movietags_output/part-00000
No comments:
Post a Comment