Hadoop learning pot: BIGLAB-MAPPER-INVERINDEX

Write the Java MapReduce Program for Inverted Index

Now, let's write the complete Java code for the MapReduce job.

Inverted Index MapReduce Job

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

import java.util.HashSet;

import java.util.StringTokenizer;

import java.util.Set;

public class InvertedIndex {

// Mapper class

public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> {

private Text word = new Text();

private Text docID = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

// Get the document ID from the input line (you can use the key itself as a document ID)

String document = value.toString();

String[] parts = document.split("\t", 2);

String documentID = parts[0];

String content = parts[1];

// Tokenize the document content

StringTokenizer tokenizer = new StringTokenizer(content);

// Emit each word with the document ID

while (tokenizer.hasMoreTokens()) {

String token = tokenizer.nextToken().toLowerCase();

word.set(token);

docID.set(documentID);

context.write(word, docID);

}

// Reducer class

public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

private Text result = new Text();

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

Set<String> documentIDs = new HashSet<>();

for (Text val : values) {

documentIDs.add(val.toString());

}

result.set(String.join(",", documentIDs));

context.write(key, result);

}

// Main driver method

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "Inverted Index");

job.setJarByClass(InvertedIndex.class);

job.setMapperClass(InvertedIndexMapper.class);

job.setReducerClass(InvertedIndexReducer.class);

// Set the output key and value types for the job

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

// Set the input and output paths (taken from arguments)

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Wait for the job to complete

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Step 3: Explanation of the Code

Mapper Class (InvertedIndexMapper):

The map() method is where the document is read, and words are emitted as keys with the document ID as the value.
The document is split into words using a StringTokenizer, and each word is converted to lowercase.
We use context.write(word, documentID) to emit the word as the key and the document ID as the value.

Reducer Class (InvertedIndexReducer):

The reduce() method aggregates the document IDs for each word.
A HashSet is used to store document IDs to avoid duplicates.
Finally, the document IDs are joined with commas and written out as the result.

Main Method:

Configures the job, sets the mapper and reducer classes, and specifies the input and output paths.
It expects two arguments: the input directory and the output directory.
After setting up the job, it runs the job and waits for completion.

Step 4: Running the Program

To run the program on Hadoop, follow these steps:

Compile the Java Code:

First, compile the Java code using javac and create a JAR file.
Ensure that you include the Hadoop libraries in your classpath.

javac -classpath $(hadoop classpath) -d /path/to/output InvertedIndex.java

jar -cvf invertedindex.jar -C /path/to/output .

Run the Program on Hadoop:

Once the JAR file is created, you can run the program using Hadoop’s hadoop hadoop jar invertedindex.jar InvertedIndex input_directory output_directory

Where:

input_directory is the directory containing the input text files (where each line is a document, and the format is docID<tab>content).
output_directory is where the output (inverted index) will be stored.

Step 5: Example Input and Output

Input (in input.txt):

swift

Copy

1 The quick brown fox

2 The lazy dog

3 The quick dog

Output (in output directory, part-r-00000):

swift

Copy

brown 1

dog 2,3

fox 1

lazy 2

quick 1,3

the 1,2,3

This output shows each word from the documents along with the list of document IDs where the word appears.

Hadoop learning pot

Search This Blog

Sunday, 2 March 2025

BIGLAB-MAPPER-INVERINDEX

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Search This Blog