Search This Blog

Sunday, 2 March 2025

BIGLAB-HABSE-1

 

Create an HBase Table for Orders

Let’s create a table named orders to store the order data. The table will have the following columns:

  • Row Key: order_id (unique identifier for the order)
  • Column Family 1: order_info
    • customer_id: Customer who made the order
    • product: The product ordered
    • quantity: Quantity of the product ordered
    • price: Price per unit of the product
    • order_date: Date when the order was placed
    • status: Status of the order (e.g., Shipped, Pending)

Create the Table:


create 'orders', 'order_info'

This creates a table named orders with a single column family order_info.


Step 3: Insert Data into the Table

Now, let's insert some sample data into the orders table.

Sample Orders Data:

  • Order 1: order_id = 1, customer_id = C123, product = Laptop, quantity = 1, price = 1200, order_date = 2023-03-01, status = Shipped
  • Order 2: order_id = 2, customer_id = C124, product = Phone, quantity = 2, price = 600, order_date = 2023-03-02, status = Pending
  • Order 3: order_id = 3, customer_id = C125, product = Tablet, quantity = 3, price = 400, order_date = 2023-03-02, status = Shipped

Insert Data:

You can insert the above data into HBase using the following commands in the HBase shell:

put 'orders', '1', 'order_info:customer_id', 'C123'
put 'orders', '1', 'order_info:product', 'Laptop' put 'orders', '1', 'order_info:quantity', '1' put 'orders', '1', 'order_info:price', '1200' put 'orders', '1', 'order_info:order_date', '2023-03-01' put 'orders', '1', 'order_info:status', 'Shipped' put 'orders', '2', 'order_info:customer_id', 'C124' put 'orders', '2', 'order_info:product', 'Phone' put 'orders', '2', 'order_info:quantity', '2' put 'orders', '2', 'order_info:price', '600' put 'orders', '2', 'order_info:order_date', '2023-03-02' put 'orders', '2', 'order_info:status', 'Pending' put 'orders', '3', 'order_info:customer_id', 'C125' put 'orders', '3', 'order_info:product', 'Tablet' put 'orders', '3', 'order_info:quantity', '3' put 'orders', '3', 'order_info:price', '400' put 'orders', '3', 'order_info:order_date', '2023-03-02' put 'orders', '3', 'order_info:status', 'Shipped'

Step 4: Perform Query Analysis on the Data

Once the data is inserted into HBase, we can start querying and performing some analysis. Here are some example queries you might run.

1. Query a Single Order by order_id

To retrieve all the information for the order with order_id = 1, you can use the following command:


get 'orders', '1'

This will return the data for order 1:


COLUMN CELL order_info:customer_id timestamp=..., value=C123 order_info:product timestamp=..., value=Laptop order_info:quantity timestamp=..., value=1 order_info:price timestamp=..., value=1200 order_info:order_date timestamp=..., value=2023-03-01 order_info:status timestamp=..., value=Shipped

2. Query All Orders for a Specific Product

To find all orders that contain the product "Laptop", you would have to scan the table and filter by the product column. While HBase does not support direct SQL-style queries, you can scan through the rows and filter based on the product column.


scan 'orders', {FILTER => "SingleColumnValueFilter('order_info', 'product', =, 'binary:Laptop')"}

This command scans the orders table and filters rows where the product column contains the value "Laptop".

3. Get the Total Quantity of Products Ordered

To get the total quantity of products ordered, you can scan the table and sum the values of the quantity column:

scan 'orders'

Then, you would manually sum the quantity values in the output (in this case, 1 + 2 + 3 = 6).

4. Query Orders Based on order_date Range

To query for orders placed after a certain date, you will have to manually check the order_date field during the scan or use a more advanced approach like HBase Filters. For example:


scan 'orders', {FILTER => "SingleColumnValueFilter('order_info', 'order_date', >, 'binary:2023-03-01')"}

This query would return all orders placed after 2023-03-01.

5. Get Count of Orders by status (e.g., Pending, Shipped)

HBase doesn't support direct aggregation like SQL databases, but you can scan the table and count how many orders have a certain status. You would filter rows based on the status column:


scan 'orders', {FILTER => "SingleColumnValueFilter('order_info', 'status', =, 'binary:Shipped')"}

This would return all orders with the status "Shipped".



BIGLAB-HBASE

 Prepare sample data  :

In local directory and copy in to hdfs directory

 

row1     Family1:col1      Family1:col2      Family2:col1

row2     Family1:col1      Family1:col2      Family2:col1

row3     Family1:col1      Family1:col2      Family2:col1

 

Hadoop fs -mkdir data

Hadoop fs -put data.tsv data

Create Hbase table

hbase shell>

create 'my_table', 'Family1', 'Family2'

Ø  $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv

  -Dimporttsv.columns="HBASE_ROW_KEY,Family1:col1,Family1:col2,Family2:col1"

  my_table  /user/training/data/inputfile.tsv

 ----------------------------------------------------------------------------------------------------------------

Let's assume we have a sales dataset with the following columns:

  • transaction_id (Row Key)
  • product_id (Product ID)
  • product_name (Product Name)
  • region (Region where the sale happened)
  • sales_amount (Amount of Sale)
  • sale_date (Date of Transaction)

Steps for HBase Big Data Analytics Without Spark:


Step 1: Create and Load Sample Data into HBase

1.1 Create Sales Table in HBase

In HBase shell, we first need to create a table to store the sales data.

hbase shell

create 'sales', 'details'

This creates a table called sales with a column family details.

1.2 Load Sample Sales Data into HBase

Now, let's load some sample sales data into this table using the put command in the HBase shell.


put 'sales', '1', 'details:product_id', '1001'

put 'sales', '1', 'details:product_name', 'Laptop'

put 'sales', '1', 'details:region', 'North'

put 'sales', '1', 'details:sales_amount', '1200'

put 'sales', '1', 'details:sale_date', '2025-01-15'

 

put 'sales', '2', 'details:product_id', '1002'

put 'sales', '2', 'details:product_name', 'Smartphone'

put 'sales', '2', 'details:region', 'South'

put 'sales', '2', 'details:sales_amount', '800'

put 'sales', '2', 'details:sale_date', '2025-01-16'

 

put 'sales', '3', 'details:product_id', '1003'

put 'sales', '3', 'details:product_name', 'Tablet'

put 'sales', '3', 'details:region', 'East'

put 'sales', '3', 'details:sales_amount', '500'

put 'sales', '3', 'details:sale_date', '2025-01-17'

 

put 'sales', '4', 'details:product_id', '1001'

put 'sales', '4', 'details:product_name', 'Laptop'

put 'sales', '4', 'details:region', 'West'

put 'sales', '4', 'details:sales_amount', '1100'

put 'sales', '4', 'details:sale_date', '2025-01-17'

This inserts four records, each representing a sale with product, region, and sales amount information.


Step 2: Querying Data with HBase Shell

You can retrieve or query data directly from HBase using the HBase shell.

2.1 Retrieve a Single Record

To retrieve data for a specific row, use the get command:


get 'sales', '1'

This will return the sale data for the record with row key 1.

2.2 Scan the Entire Table

To scan all the rows in the sales table, use the scan command:

scan 'sales'

This will display all the records from the sales table.

2.3 Scan with Filters

You can also apply filters to your queries. For example, to get all records where product_name is Laptop, use:

scan 'sales', {FILTER => "SingleColumnValueFilter('details', 'product_name', =, 'binary:Laptop')"}

This scans for all sales where the product_name is Laptop.



-----------------------------------------------------------------

BIGLAB-MAPPER-INVERINDEX

 Write the Java MapReduce Program for Inverted Index

Now, let's write the complete Java code for the MapReduce job.

Inverted Index MapReduce Job

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 

import java.io.IOException;

import java.util.HashSet;

import java.util.StringTokenizer;

import java.util.Set;

 

public class InvertedIndex {

 

    // Mapper class

    public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> {

        private Text word = new Text();

        private Text docID = new Text();

 

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            // Get the document ID from the input line (you can use the key itself as a document ID)

            String document = value.toString();

            String[] parts = document.split("\t", 2);

            String documentID = parts[0];

            String content = parts[1];

 

            // Tokenize the document content

            StringTokenizer tokenizer = new StringTokenizer(content);

 

            // Emit each word with the document ID

            while (tokenizer.hasMoreTokens()) {

                String token = tokenizer.nextToken().toLowerCase();

                word.set(token);

                docID.set(documentID);

                context.write(word, docID);

            }

        }

    }

 

 

 

 

 

 

 

 

 

 

    // Reducer class

    public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

        private Text result = new Text();

 

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

            Set<String> documentIDs = new HashSet<>();

            for (Text val : values) {

                documentIDs.add(val.toString());

            }

            result.set(String.join(",", documentIDs));

            context.write(key, result);

        }

    }

 

    // Main driver method

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "Inverted Index");

 

        job.setJarByClass(InvertedIndex.class);

        job.setMapperClass(InvertedIndexMapper.class);

        job.setReducerClass(InvertedIndexReducer.class);

 

        // Set the output key and value types for the job

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

 

        // Set the input and output paths (taken from arguments)

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

 

        // Wait for the job to complete

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

Step 3: Explanation of the Code

  1. Mapper Class (InvertedIndexMapper):
    • The map() method is where the document is read, and words are emitted as keys with the document ID as the value.
    • The document is split into words using a StringTokenizer, and each word is converted to lowercase.
    • We use context.write(word, documentID) to emit the word as the key and the document ID as the value.
  2. Reducer Class (InvertedIndexReducer):
    • The reduce() method aggregates the document IDs for each word.
    • A HashSet is used to store document IDs to avoid duplicates.
    • Finally, the document IDs are joined with commas and written out as the result.
  3. Main Method:
    • Configures the job, sets the mapper and reducer classes, and specifies the input and output paths.
    • It expects two arguments: the input directory and the output directory.
    • After setting up the job, it runs the job and waits for completion.

Step 4: Running the Program

To run the program on Hadoop, follow these steps:

  1. Compile the Java Code:
    • First, compile the Java code using javac and create a JAR file.
    • Ensure that you include the Hadoop libraries in your classpath.

javac -classpath $(hadoop classpath) -d /path/to/output InvertedIndex.java

jar -cvf invertedindex.jar -C /path/to/output .

  1. Run the Program on Hadoop:
    • Once the JAR file is created, you can run the program using Hadoop’s hadoop hadoop jar invertedindex.jar InvertedIndex input_directory output_directory

Where:

  • input_directory is the directory containing the input text files (where each line is a document, and the format is docID<tab>content).
  • output_directory is where the output (inverted index) will be stored.

Step 5: Example Input and Output

Input (in input.txt):

swift

Copy

1    The quick brown fox

2    The lazy dog

3    The quick dog

Output (in output directory, part-r-00000):

swift

Copy

brown   1

dog     2,3

fox     1

lazy    2

quick   1,3

the     1,2,3

This output shows each word from the documents along with the list of document IDs where the word appears.

 

Hadoop Analytics

BIGLAB-HABSE-1

  Create an HBase Table for Orders Let’s create a table named orders to store the order data. The table will have the following columns: Ro...