Search This Blog

Tuesday, 3 March 2026

AI & DS HUE EXPERIMENT

 

 STEP 1 — Create Sample Dataset (On Linux)

Create emp1.csv:

vi emp1.csv

Paste:

employee_id,name,department,salary
1,John Doe,Engineering,70000
2,Jane Smith,Marketing,60000
3,Jim Brown,Engineering,80000
4,Jake White,Sales,50000
5,Emily Davis,Marketing,75000

Save and exit.


start hue

http://localhost:8088
training
training

 STEP 2 — Upload File to HDFS

Option A (Using Command Line)

hadoop fs -mkdir -p /user/hue
hadoop fs -put emp1.csv /user/hue/
hadoop fs -ls /user/hue

You should see emp1.csv listed.


Option B (Using Hue UI)

  1. Open Hue

  2. Go to File Browser

  3. Click Upload

  4. Select employees.csv

  5. Upload to:

/user/hue/

✅ STEP 3 — Create Hive Table in Hue

Go to:

Hue → Query Editors → Hive


CREATE TABLE emp1 (
employee_id INT,
name STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Then load data:

LOAD DATA INPATH '/user/hue/emp1.csv'
INTO TABLE emp1;

✅ STEP 4 — Verify Data

Run:

SELECT * FROM emp1;

You should see 5 rows.


✅ STEP 5 — Analysis Queries

1️⃣ Count Employees by Department

SELECT department, COUNT(*) AS employee_count
FROM emp1
GROUP BY department;



2️⃣ Average Salary by Department

SELECT department, AVG(salary) AS average_salary
FROM emp1
GROUP BY department;

3️⃣ Salary Greater Than 60000

SELECT name, salary
FROM emp1
WHERE salary > 60000;

✅ STEP 6 — Generate Reports

After running query in Hue:

  1. Results appear below

  2. Click Download

  3. Export as:

    • CSV

    • Excel

This becomes your report.


✅ STEP 7 — Create Dashboard in Hue

  1. Go to Dashboards

  2. Click New Dashboard

  3. Add a widget

  4. Select your saved Hive query

  5. Choose:

    • Bar Chart → Employee count by department

    • Pie Chart → Salary distribution

Save dashboard.

Now you have visual report.


✅ OPTIONAL — Using Impala (Faster Queries)

In Hue:

Query Editors → Impala

Run same queries (faster performance).

AI &DS MONGO DB EXPERIMENT IN CLOUD ERA

sudo rm -rf /usr/local/mongodb

wget http://fastdl.mongodb.org/linux/mongodb-linux-i686-2.6.12.tgz

tar -xvzf mongodb-linux-i686-2.6.12.tgz

sudo mv mongodb-linux-i686-2.6.12 /usr/local/mongodb


file /usr/local/mongodb/bin/mongo

Start MongoDB Server

First create data directory (only first time):

sudo mkdir -p /data/db
sudo chmod -R 777 /data



/usr/local/mongodb/bin/mongod --dbpath /data/db


You will see logs like:

waiting for connections on port 27017


Open Mongo Shell (New Terminal)

Open a new terminal window and run:

/usr/local/mongodb/bin/mongo

You will see:

>
> show dbs

>use trainingdb

db.students.insert([

 {name:"Ravi", age:22, course:"Hadoop", marks:75},

 {name:"Anita", age:21, course:"MongoDB", marks:85},

 {name:"Suresh", age:23, course:"Hadoop", marks:65},

 {name:"Priya", age:22, course:"Spark", marks:90},

 {name:"Kiran", age:24, course:"MongoDB", marks:70},

 {name:"Meena", age:21, course:"Hadoop", marks:88},

 {name:"Arun", age:23, course:"Spark", marks:60}

])


>

COUNT Function

Count total documents:

db.students.count()

Count with condition:

db.students.count({course:"Hadoop"})

 SORT Function

Sort by marks ascending:

db.students.find().sort({marks:1})

Sort by marks descending:

db.students.find().sort({marks:-1})

Sort by age:

db.students.find().sort({age:1})

 LIMIT Function

Top 3 students:

db.students.find().limit(3)

Top 3 highest marks:

db.students.find().sort({marks:-1}).limit(3)

 SKIP Function

Skip first 2 records:

db.students.find().skip(2)

Pagination example (skip 2, show next 3):

db.students.find().skip(2).limit(3)

 AGGREGATE Function (Important)

Aggregation framework example.


🔹 Total number of students per course

db.students.aggregate([
{ $group: { _id: "$course", totalStudents: { $sum: 1 } } }
])

🔹 Average marks per course

db.students.aggregate([
{ $group: { _id: "$course", avgMarks: { $avg: "$marks" } } }
])

🔹 Maximum marks per course

db.students.aggregate([
{ $group: { _id: "$course", maxMarks: { $max: "$marks" } } }
])

🔹 Minimum marks per course

db.students.aggregate([
{ $group: { _id: "$course", minMarks: { $min: "$marks" } } }
])

🔹 Sort after aggregation

db.students.aggregate([
{ $group: { _id: "$course", avgMarks: { $avg: "$marks" } } },
{ $sort: { avgMarks: -1 } }
])

STOP MONGO DB

If  you started MongoDB like this:

/usr/local/mongodb/bin/mongod --dbpath /data/db

Then simply go to that terminal and press:

CTRL + C


AI & DS PIG EXPERIMENT

 first create these files in your terminal using cat command


cat > employees.csv

     1, John Doe, 101, 50000

     2, Jane Smith, 102, 60000

     3, Jim Brown, 101, 55000

     4, Jake White, 103, 70000

 

    cat > departments.csv

     101, HR

     102, Engineering

     103, Marketing



move to hadoop directory

>hadoop fs -put employees.csv emp.csv

>hadoop fs -put departments.csv dept.csv






> vi pig1.pig


employees = LOAD 'emp.csv' USING PigStorage(',') AS (employee_id:int, name:chararray,

     department_id:int, salary:float);

dump employees;


> pig pig1.pig


> vi pig2.pig


departments = LOAD 'dept.csv'        USING    PigStorage(',')   AS   (department_id:int,

     department_name:chararray);

dump departments;


>pig pig2.pig



>vi pig3.pig

    employees = LOAD 'emp.csv' USING PigStorage(',') AS (employee_id:int, name:chararray,

     department_id:int, salary:float);

     high_salary_employees = FILTER employees BY salary > 55000;

     dump high_salary_employees;


>pig pig3.pig

TRY these scripts 

-- Project only the name and salary of high salary employees

     projected_employees = FOREACH high_salary_employees GENERATE name, salary;

 

     -- Group employees by department_id

     grouped_employees = GROUP employees BY department_id;

 

     -- Join employees with departments to get department names

     joined_data = JOIN employees BY department_id, departments BY department_id;

 

     -- Sort employees by salary in descending order

     sorted_employees = ORDER employees BY salary DESC;

 

     -- Store the results of the sorted employees

     STORE sorted_employees INTO 'sorted_employees_output' USING PigStorage(',');


AI & DS HIVE EXPERIMENT

 open terminal


>hive


>CREATE DATABASE IF NOT EXISTS company_db;


USE company_db;

 

        


open new terminal


> vi  hivescript1.ql


       use company_db;

        CREATE TABLE IF NOT EXISTS employees (

           employee_id INT,

           name STRING,

           department_id INT,

           salary FLOAT

        ) ROW FORMAT DELIMITED

        FIELDS TERMINATED BY  ','

        STORED AS TEXTFILE;

 

>  hive -S -f hivescript1.ql


> vi  hivescript2.ql


     use company_db;

        CREATE TABLE IF NOT EXISTS departments (

           department_id INT,

           department_name STRING

        ) ROW FORMAT DELIMITED

        FIELDS TERMINATED BY   ','

        STORED AS TEXTFILE;


>  hive -S -f hivescript2.ql


cat > emp.txt

1,John,101,50000
2,Anita,101,45000
3,Ravi,102,65000
4,Meena,103,55000
5,Kiran,104,40000


cat>department.txt

101,hr
102,mark
103,fin


go to hive


hive> load data local inpath 'emp.txt' into table employees;

hive> load data local inpath 'department.txt' into table departments;


-- Add a new column to the employees table

        ALTER TABLE employees ADD COLUMNS (hire_date DATE);

 

        -- Create a view to show high salary employees

        CREATE VIEW IF NOT EXISTS high_salary_employees AS

        SELECT name, salary

        FROM employees

        WHERE salary > 60000;

 

        -- Create an index on the employees table for the department_id column

        CREATE INDEX idx_department_id

        ON TABLE employees (department_id)

  AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

     WITH DEFERRED REBUILD;

 

     -- Drop the view

     DROP VIEW IF EXISTS high_salary_employees;

 

     -- Drop the employees table

     DROP TABLE IF EXISTS employees;

 

     -- Drop the departments table

     DROP TABLE IF EXISTS departments;

 

     -- Drop the index

     DROP INDEX IF EXISTS idx_department_id ON TABLE employees;

 

     -- Drop the database

     DROP DATABASE IF EXISTS company_db CASCADE;


Monday, 2 March 2026

load sample data in to hive

 

Step 1: Create Sample Delimited Text File

We’ll create a simple employee dataset.

📄 Create a text file in Linux

vi employee.txt

Add sample data (Comma Delimited)

1,John,IT,50000
2,Anita,HR,45000
3,Ravi,Finance,60000
4,Meena,IT,55000
5,Kiran,Sales,40000

Save and exit.


✅ Step 2: Move File to HDFS

Create HDFS directory

hadoop fs -mkdir -p /user/hive/employee_data

Upload file to HDFS

hadoop fs -put employee.txt /user/hive/employee_data/

Verify

hadoop fs -ls /user/hive/employee_data

✅ Step 3: Start Hive

hive

✅ Step 4: Create Hive Database (Optional)

CREATE DATABASE company;
USE company;

✅ Step 5: Create Hive Table (Row Formatted Delimited)

CREATE TABLE employee (
id INT,
name STRING,
department STRING,
salary INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

✅ Step 6: Load Data from HDFS into Hive Table

LOAD DATA INPATH '/user/hive/employee_data/employee.txt'
INTO TABLE employee;

✅ Step 7: Verify Data

SELECT * FROM employee;

hive lab manual

 

Use Hive to create, alter, and drop databases, tables, views, functions, and

    indexes.

 

        -- Create a new database

        CREATE DATABASE IF NOT EXISTS company_db;

 

        -- Switch to the newly created database

        USE company_db;

 

        -- Create an employees table

        CREATE TABLE IF NOT EXISTS employees (

           employee_id INT,

           name STRING,

           department_id INT,

           salary FLOAT

        ) ROW FORMAT DELIMITED

        FIELDS TERMINATED BY ','

        STORED AS TEXTFILE;

 

        -- Create a departments table

        CREATE TABLE IF NOT EXISTS departments (

           department_id INT,

           department_name STRING

        ) ROW FORMAT DELIMITED

        FIELDS TERMINATED BY ','

        STORED AS TEXTFILE;

 

        -- Add a new column to the employees table

        ALTER TABLE employees ADD COLUMNS (hire_date DATE);

 

        -- Create a view to show high salary employees

        CREATE VIEW IF NOT EXISTS high_salary_employees AS

        SELECT name, salary

        FROM employees

        WHERE salary > 60000;

 

        -- Create an index on the employees table for the department_id column

        CREATE INDEX idx_department_id

        ON TABLE employees (department_id)

  AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

     WITH DEFERRED REBUILD;

 

     -- Drop the view

     DROP VIEW IF EXISTS high_salary_employees;

 

     -- Drop the employees table

     DROP TABLE IF EXISTS employees;

 

     -- Drop the departments table

     DROP TABLE IF EXISTS departments;

 

     -- Drop the index

     DROP INDEX IF EXISTS idx_department_id ON TABLE employees;

 

     -- Drop the database

     DROP DATABASE IF EXISTS company_db CASCADE;

 

     Notes

    • Make sure to replace the paths and class names with the actual paths and class names for your

       UDFs.

    • The CASCADE option in the DROP DATABASE command will drop all tables and views in the

       database.

    • You can run these commands in a Hive shell or through a Hive client that supports HiveQL.

 

 

 

Hadoop Analytics

AI & DS HUE EXPERIMENT

   STEP 1 — Create Sample Dataset (On Linux) Create emp1.csv: vi emp1.csv Paste: employee_id,name,department,salary 1,John Doe,Engineer...