Hadoop learning pot: HIVE BDA LAB

1. Create Database in Hive

The CREATE DATABASE command in Hive is used to create a new database. By default, Hive creates databases under the /user/hive/warehouse directory on HDFS, unless you specify a different location.

Example: Creating a Simple Database

sql
-- Create a database called 'company_db'
CREATE DATABASE company_db;

This creates a database named company_db.

Example: Creating a Database with a Specific Location

You can specify a custom location on HDFS for the database to store its data.

sql
-- Create a database with a custom location
CREATE DATABASE company_db
LOCATION '/user/hadoop/company_db_location';

This will create a database company_db and store the data at the specified HDFS path /user/hadoop/company_db_location.

2. Alter Database in Hive

You can alter a database in Hive using the ALTER DATABASE command. Some common alterations include changing the database's location or setting specific properties for the database.

Example: Altering the Location of a Database

You can change the location of a database after it has been created.

sql
-- Alter the database location
ALTER DATABASE company_db SET LOCATION '/user/hadoop/new_location/';

This changes the location of the company_db database to /user/hadoop/new_location/ in HDFS.

Example: Altering the Properties of a Database

You can also modify database properties using the SET option.

sql
-- Alter database properties
ALTER DATABASE company_db SET DBPROPERTIES ('owner'='John Doe', 'created_on'='2025-03-01');

This command sets custom properties for the company_db database, such as the owner and creation date.

3. Drop Database in Hive

You can drop (delete) a database using the DROP DATABASE command. This operation will remove the database and, depending on whether you use the CASCADE option, it will also delete any tables, views, and data associated with it.

Example: Dropping a Database

To drop a database without removing its tables and views:

sql
-- Drop a database without cascading
DROP DATABASE company_db;

If the database has any tables or views, this command will fail.

Example: Dropping a Database with CASCADE

To drop a database and all its associated tables and views, you must use the CASCADE option:

sql
-- Drop a database and its tables/views
DROP DATABASE company_db CASCADE;

This command deletes the company_db database along with any tables, views, and data inside it.

4. Show All Databases in Hive

To view all databases in Hive, you can use the SHOW DATABASES command:

sql
-- List all databases in Hive
SHOW DATABASES;

This will return a list of all databases in your Hive instance.

Example Scenario: Full Script with Operations

Let's now combine all these operations into a script to demonstrate the process.

sql
-- Step 1: Create a database named 'company_db'
CREATE DATABASE company_db;

-- Step 2: Switch to the new database
USE company_db;

-- Step 3: Create a table in 'company_db'
CREATE TABLE employees (
  emp_id INT,
  emp_name STRING,
  emp_age INT,
  emp_city STRING
);

-- Step 4: Alter the database properties of 'company_db'
ALTER DATABASE company_db SET DBPROPERTIES ('owner'='Alice', 'created_on'='2025-03-23');

-- Step 5: Alter the database location of 'company_db'
ALTER DATABASE company_db SET LOCATION '/user/hadoop/company_db_location';

-- Step 6: Show all databases to confirm 'company_db' exists
SHOW DATABASES;

-- Step 7: Drop the 'company_db' database along with its tables/views
DROP DATABASE company_db CASCADE;

Explanation of the Script:

Create Database: We created a database company_db.
Use Database: We switched to company_db using the USE command.
Create Table: We created a table employees in the company_db database.
Alter Database Properties: We set custom properties (owner, created_on) for the database.
Alter Database Location: We changed the location of the company_db database.
Show Databases: We listed all databases to check if company_db exists.
Drop Database: We dropped the company_db database and all its tables and views using the CASCADE option.

how to work with views, functions, and indexes in Hive, along with suitable examples for each operation.

1. Hive Views

A view in Hive is a virtual table based on the result of a query. Views do not store data physically, but they provide an abstraction layer over the data in underlying tables. You can create a view, query it like a table, and drop it when needed.

Create a View

You can create a view using the CREATE VIEW command. A view is created based on the result of a SELECT query.

Example: Creating a View to Filter Employees Over 30 Years Old

sql
-- Create a view named 'over_30_employees' which filters employees whose age is greater than 30
CREATE VIEW over_30_employees AS
SELECT emp_id, emp_name, emp_age, emp_city
FROM employees
WHERE emp_age > 30;

In this example:

A view over_30_employees is created that contains the details of employees older than 30 years.
The view stores the query, not the data itself.

Query the View

Once the view is created, you can query it like a regular table.

sql
-- Query the view to get data
SELECT * FROM over_30_employees;

Drop the View

To drop the view (delete the virtual table):

sql
-- Drop the view if it exists
DROP VIEW IF EXISTS over_30_employees;

2. Hive Functions

Hive allows you to use built-in functions to perform operations on your data, such as aggregation, string operations, date functions, etc. You can also create custom UDFs (User Defined Functions) to extend the functionality of Hive.

Built-in Functions

Here are some examples of commonly used built-in functions:

COUNT: Counts the number of rows.
```
sql
SELECT COUNT(*) FROM employees;
```

CONCAT: Concatenates two or more strings.

sql
SELECT CONCAT(emp_name, ' lives in ', emp_city) FROM employees;

YEAR: Extracts the year from a date.

sql
SELECT YEAR(current_date()) FROM employees;

AVG: Computes the average of a numeric column.

sql
SELECT AVG(emp_age) FROM employees;

Creating Custom UDFs (User Defined Functions)

If you want to create custom functionality, you can write your own functions in Java and register them in Hive.

Steps to create a UDF:

Write a Java UDF.
Compile it into a .jar file.
Register the UDF in Hive.

Here is a simple example of a custom UDF that takes a string and returns its length.

Java UDF Example (MyLengthUDF.java):

java
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class MyLengthUDF extends UDF {
    public int evaluate(Text str) {
        if (str == null) {
            return 0;
        }
        return str.toString().length();
    }
}

After compiling and creating a .jar file, you can add the UDF to Hive.

Register the UDF in Hive

sql
-- Add the JAR containing your custom UDF
ADD JAR /path/to/my-udf.jar;

-- Register the function
CREATE FUNCTION mylength AS 'com.example.hive.udf.MyLengthUDF';

Use the Custom Function

Once registered, you can use the custom function just like built-in functions.

sql
-- Use the custom function in a query
SELECT emp_name, mylength(emp_name) FROM employees;

Drop the Function

To drop a function:

sql
-- Drop the custom function
DROP FUNCTION IF EXISTS mylength;

3. Hive Indexes

Indexes in Hive are used to optimize query performance by allowing faster retrieval of data from large tables. Indexes are not as commonly used in Hive as in traditional relational databases, but they can help in certain scenarios.

Create an Index

You can create an index on a table to speed up lookups on specific columns. Here’s an example of creating an index on the emp_city column in the employees table.

sql
-- Create an index on the 'emp_city' column of the 'employees' table
CREATE INDEX emp_city_index
ON TABLE employees (emp_city)
AS 'COMPACT'
WITH DEFERRED REBUILD;

In this example:

The index emp_city_index is created on the emp_city column.
The COMPACT indexing method is used, which is a basic form of indexing in Hive.
WITH DEFERRED REBUILD means that the index is not built immediately, and you will need to rebuild it later.

Rebuild the Index

If the index is deferred, you can rebuild it with the REBUILD command.

sql
-- Rebuild the index after it has been created
ALTER INDEX emp_city_index ON employees REBUILD;

Drop an Index

To drop (delete) an index:

sql
-- Drop the index
DROP INDEX IF EXISTS emp_city_index ON employees;

Full Example Scenario with Views, Functions, and Indexes

Let’s now walk through a complete scenario where we use views, functions, and indexes.

Step 1: Create a Table

sql
-- Create a table 'employees' with some sample data
CREATE TABLE employees (
  emp_id INT,
  emp_name STRING,
  emp_age INT,
  emp_city STRING
);

-- Insert sample data into the table
INSERT INTO employees VALUES
(1, 'John Doe', 29, 'New York'),
(2, 'Jane Smith', 35, 'California'),
(3, 'Charlie Brown', 30, 'Texas'),
(4, 'Alice Johnson', 27, 'California');

Step 2: Create a View

sql
-- Create a view to filter employees whose age is greater than 30
CREATE VIEW over_30_employees AS
SELECT emp_id, emp_name, emp_age, emp_city
FROM employees
WHERE emp_age > 30;

Step 3: Query the View

sql
-- Query the view
SELECT * FROM over_30_employees;

Step 4: Create a Custom Function (UDF)

Let’s assume you have a UDF called mylength to calculate the length of the employee's name.

sql
-- Register the UDF
ADD JAR /path/to/my-udf.jar;
CREATE FUNCTION mylength AS 'com.example.hive.udf.MyLengthUDF';

-- Use the UDF in a query
SELECT emp_name, mylength(emp_name) FROM employees;

Step 5: Create an Index on the `emp_city` Column

sql
-- Create an index on the 'emp_city' column
CREATE INDEX emp_city_index
ON TABLE employees (emp_city)
AS 'COMPACT'
WITH DEFERRED REBUILD;

-- Rebuild the index
ALTER INDEX emp_city_index ON employees REBUILD;

Step 6: Drop the Index, View, and Function

sql
-- Drop the index
DROP INDEX IF EXISTS emp_city_index ON employees;

-- Drop the view
DROP VIEW IF EXISTS over_30_employees;

-- Drop the function
DROP FUNCTION IF EXISTS mylength;

Conclusion

In this tutorial, we’ve walked through how to use views, functions (UDFs), and indexes in Hive to enhance your data management and querying capabilities. Here’s a quick summary:

Views are used to simplify complex queries and act as virtual tables.
Functions include both built-in functions and user-defined functions (UDFs) that you can write in Java and register in Hive.
Indexes help speed up queries on large datasets by creating an index on a specific column.

Search This Blog

Sunday, 23 March 2025

HIVE BDA LAB