Hadoop learning pot: movielens sample map data

To analyze MovieLens data using Hadoop MapReduce, we can focus on extracting and processing tags associated with each movie. MovieLens data usually comes in multiple files (such as movies, ratings, tags, etc.), and we want to process the tags associated with each movie.

The typical MovieLens data consists of:

movies.csv: Contains movie metadata (movieId, title, genre).
tags.csv: Contains tags associated with movies (userId, movieId, tag).
ratings.csv: Contains user ratings for movies (userId, movieId, rating, timestamp).

Goal:

You might want to associate each movie with its tags and, for example, identify the most common tags for a particular movie or group them by movie.

Steps for the Hadoop MapReduce Program:

1. Input Data Format (tags.csv)

The tags.csv file contains the following format:

arduino
userId, movieId, tag
1, 1, "Great movie"
2, 1, "Funny"
3, 2, "Boring"
4, 2, "Great movie"
5, 3, "Action"

Here, userId is the ID of the user tagging the movie, movieId is the ID of the movie, and tag is the tag associated with that movie.

2. Write the Mapper

The Mapper will read each line of the tags.csv file and emit the movieId as the key and the tag as the value. This way, for each movie, we will have a list of tags.

java
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class TagMapper extends Mapper<Object, Text, Text, Text> {

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        // Split the line by commas
        String[] fields = value.toString().split(",");

        if (fields.length == 3) {
            // Extract movieId and tag
            String movieId = fields[1].trim();
            String tag = fields[2].trim();

            // Emit movieId as key and tag as value
            context.write(new Text(movieId), new Text(tag));
        }
    }
}

3. Write the Reducer

The Reducer will receive the movieId as the key and a list of tags as the values. It will aggregate these tags into a single output for each movieId.

Here, we'll concatenate all tags associated with a movie into a single string, separated by commas.

java
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashSet;

public class TagReducer extends Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        // Use a HashSet to avoid duplicate tags
        HashSet<String> tagSet = new HashSet<>();

        // Iterate over all tags for this movieId and add them to the set
        for (Text value : values) {
            tagSet.add(value.toString());
        }

        // Concatenate all tags into a single string
        StringBuilder tags = new StringBuilder();
        for (String tag : tagSet) {
            if (tags.length() > 0) {
                tags.append(", ");
            }
            tags.append(tag);
        }

        // Emit movieId as key and the aggregated tags as value
        context.write(key, new Text(tags.toString()));
    }
}

4. Job Configuration and Execution

You now need to configure and run the Hadoop MapReduce job to process the data. The configuration specifies the Mapper and Reducer classes, input and output paths, and the output format.

Here's the job configuration:

java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MovieTagsJob {
    public static void main(String[] args) throws Exception {
        // Set up the configuration
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Movie Tags Analysis");
        job.setJarByClass(MovieTagsJob.class);

        // Set the Mapper and Reducer classes
        job.setMapperClass(TagMapper.class);
        job.setReducerClass(TagReducer.class);

        // Set the output key and value types
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        // Set the input and output paths
        FileInputFormat.addInputPath(job, new Path(args[0])); // Input path for tags.csv
        FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output path for results

        // Wait for the job to complete
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

5. Run the Program

After writing the Mapper, Reducer, and Job configuration, compile the program and run it on a Hadoop cluster or in local mode. You can execute the program with the following command (assuming the JAR file is compiled):

bash
hadoop jar MovieTagsJob.jar MovieTagsJob /input/tags.csv /output

The tags.csv is the input file containing the tags, and /output is the directory where the results will be stored.

6. Output Example

After the job runs, you should see an output file with the movie IDs and the corresponding tags. The output format will look like this:


1   Great movie, Funny
2   Boring, Great movie
3   Action

This output shows that:

Movie 1 has tags "Great movie" and "Funny".
Movie 2 has tags "Boring" and "Great movie".
Movie 3 has the tag "Action

Hadoop learning pot

Search This Blog

Sunday, 23 March 2025

movielens sample map data

Goal:

Steps for the Hadoop MapReduce Program:

1. Input Data Format (tags.csv)

2. Write the Mapper

3. Write the Reducer

4. Job Configuration and Execution

5. Run the Program

6. Output Example

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Search This Blog