Search This Blog

Monday, 2 March 2026

hive installation in hadoop

 

STEP 1 — Download Hive

Go to home directory:

cd ~

Download Hive (example: 3.1.3 – stable for Hadoop 3.x):

wget https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

Extract:

tar -xvzf apache-hive-3.1.3-bin.tar.gz
sudo mv apache-hive-3.1.3-bin /usr/local/hive
sudo chown -R $USER:$USER /usr/local/hive

STEP 2 — Set Hive Environment Variables

Open:

nano ~/.bashrc

Add at bottom:

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Apply:

source ~/.bashrc

Check:

hive --version

 STEP 3 — Configure Hive

Go to Hive config directory:

cd /usr/local/hive/conf

Copy template:

cp hive-default.xml.template hive-site.xml

 STEP 4 — Configure Hive Metastore (Derby – Simple Mode)

Since this is lab setup, we use embedded Derby database (no MySQL needed).

Edit:

nano hive-site.xml

Inside <configuration> add:

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/ubuntu/metastore_db;create=true</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>

⚠ Replace ubuntu if your username is different.

Save & Exit.


 STEP 5 — Create Hive Warehouse Directory in HDFS

Start Hadoop if not running:

start-dfs.sh
start-yarn.sh

Now create warehouse folder:

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive

 STEP 6 — Initialize Hive Metastore

Go to Hive home:

cd /usr/local/hive

Initialize schema:

schematool -dbType derby -initSchema

You should see:

Initialization completed successfully

 STEP 7 — Start Hive

Simply run:

hive

You should see:

hive>

If yes → Hive installed successfully 🎉


 STEP 8 — Create Database

Inside Hive:

CREATE DATABASE mydb;

Check:

SHOW DATABASES;

You should see:

default
mydb

 STEP 9 — Use Database

USE mydb;

 STEP 10 — Create Table

CREATE TABLE student (
id INT,
name STRING,
marks INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Check tables:

SHOW TABLES;

 STEP 11 — Insert Data

Create sample file in Ubuntu:

echo "1,John,85" > student.txt
echo "2,Alice,90" >> student.txt
echo "3,Bob,78" >> student.txt

Upload to HDFS:

hdfs dfs -mkdir /input
hdfs dfs -put student.txt /input/

Back in Hive:

LOAD DATA INPATH '/input/student.txt' INTO TABLE student;

 STEP 12 — Query Table

SELECT * FROM student;

Output:

1 John 85
2 Alice 90
3 Bob 78

 STEP 13 — Simple Query Example

SELECT name, marks FROM student WHERE marks > 80;

Sunday, 1 March 2026

hadoop installation in ubuntu

 sudo apt update

sudo apt upgrade -y

sudo apt install openjdk-11-jdk -y
sudo apt install openjdk-11-jdk -y

java -version

nano ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin


source ~/.bashrc



sudo apt install ssh -y



ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost


cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys


sudo apt dist-upgrade -y


sudo do-release-upgrade

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /usr/local/hadoop
sudo chown -R $USER:$USER /usr/local/hadoop


nano ~/.bashrc

Add at bottom:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin


source ~/.bashrc

hadoop version

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# export JAVA_HOME=

Replace with:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

/usr/local/hadoop/etc/hadoop

🔹 core-site.xml

nano core-site.xml

Inside <configuration> add:

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

🔹 hdfs-site.xml

nano hdfs-site.xml

IMPORTANT: Use full path (NO $USER)

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/ubuntu/hdfs/namenode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/ubuntu/hdfs/datanode</value>
</property>

⚠️ Replace ubuntu if your username is different.


🔹 mapred-site.xml

cp mapred-site.xml.template mapred-site.xml
nano mapred-site.xml

Add:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

🔹 yarn-site.xml

nano yarn-site.xml

Add:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

STEP 9 — Create Hadoop Directories (Correct Location)

mkdir -p /home/ubuntu/hdfs/namenode
mkdir -p /home/ubuntu/hdfs/datanode

Do NOT create in /root
Do NOT use $USER


STEP 10 — Format NameNode

hdfs namenode -format

You must see:

Storage directory ... has been successfully formatted

STEP 11 — Start Hadoop

start-dfs.sh
start-yarn.sh

STEP 12 — Verify Services

jps

You should see:


word count ubuntu hadoop and spark

 

PART 1: Word Count in Hadoop (Python – Hadoop Streaming)


 Step 1: Create Input File

nano input.txt

Example content:

hello world
hello hadoop
big data hadoop
hello spark

Step 2: Create Mapper

nano mapper.py

 mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")

Make executable:

chmod +x mapper.py

 Step 3: Create Reducer

nano reducer.py

 reducer.py

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
line = line.strip()
word, count = line.split("\t")
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")

Make executable:

chmod +x reducer.py

🧪 Test Locally

cat input.txt | ./mapper.py | sort | ./reducer.py

 Run in Hadoop

Upload to HDFS

hdfs dfs -mkdir /wordcount
hdfs dfs -put input.txt /wordcount

Run Streaming Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
-input /wordcount/input.txt \
-output /wordcount_output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

View Output

hdfs dfs -cat /wordcount_output/part-00000

 PART 2: Install Apache Spark on Ubuntu


 Step 1: Download Spark

Go to official website of Apache Spark
Download latest Spark (pre-built for Hadoop).

Or use terminal:

wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

(Adjust version if needed.)


 Step 2: Extract Spark

tar -xvzf spark-3.5.1-bin-hadoop3.tgz
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

 Step 3: Set Environment Variables

Open:

nano ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply:

source ~/.bashrc

 Step 4: Verify Spark

spark-shell

If Spark starts successfully → Installation successful ✅

Exit:

:quit

 PART 3: Word Count in Spark (PySpark)


 Method 1: Using PySpark Script

Create file:

nano spark_wordcount.py

 spark_wordcount.py

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("WordCount") \
.getOrCreate()

sc = spark.sparkContext

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = (text_file
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b))

# Save output
word_counts.saveAsTextFile("spark_output")

spark.stop()

 Run Spark Word Count

spark-submit spark_wordcount.py

 View Output

cat spark_output/part-00000

Hadoop Analytics

hive installation in hadoop

  STEP 1 — Download Hive Go to home directory: cd ~ Download Hive (example: 3.1.3 – stable for Hadoop 3.x): wget https://archive.apache...