Search This Blog

Tuesday, 18 March 2025

HADOOP MANUAL INSTALLATION IN UBUNTU

 

Prerequisites

  • Ubuntu 20.04 or later.
  • JDK 8 or newer installed (since Hadoop requires Java).
  • Sudo privileges.

Step 1: Install Java Development Kit (JDK)

Hadoop requires Java, so you'll need to install it first.

  1. Update your system package index:

    bash
    sudo apt update
  2. Install OpenJDK 8 (or any other version that is supported):

    bash
    sudo apt install openjdk-8-jdk
  3. Verify the installation:

    bash
    java -version

Step 2: Install Hadoop (Supported version: 3.3.x)

Hadoop is the first tool you need to install.

  1. Download Hadoop: Go to the Hadoop releases page and get the latest stable version (e.g., 3.3.x). At the time of writing, Hadoop 3.3.1 is stable.


    wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
  2. Extract the downloaded tar.gz file:


    tar -xvzf hadoop-3.3.1.tar.gz
  3. Move Hadoop to /usr/local/ directory:


    sudo mv hadoop-3.3.1 /usr/local/hadoop
  4. Set Hadoop Environment Variables: Open your .bashrc file to add environment variables for Hadoop:

    nano ~/.bashrc
  1. Add the following lines at the end of the file:

    bash
    export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME

    Save and close the file. Then, apply the changes:

    bash
    source ~/.bashrc
  2. Configure Hadoop: Hadoop needs to be configured with some basic settings.

    Edit the following files in /usr/local/hadoop/etc/hadoop/:

    • hadoop-env.sh: Set Java home.

      bash
      nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

      Add the line:

      bash
      export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    • core-site.xml: Configure the core site to set the default filesystem.

      bash
      nano /usr/local/hadoop/etc/hadoop/core-site.xml

      Add the following content:

      xml
      <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
    • hdfs-site.xml: Set the replication factor and directories for Hadoop.

      bash
      nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

      Add the following content:

      xml
      <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop/hdfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/usr/local/hadoop/hdfs/data</value> </property> </configuration>
    • mapred-site.xml: Set the job tracker URI.

      bash
      cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

      Add:

      xml
      <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
    • yarn-site.xml: Set the YARN Resource Manager URI.

      bash
      nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

      Add:

      xml
      <configuration> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>localhost:8030</value> </property> </configuration>

Step 3: Format the HDFS Filesystem

Before starting Hadoop, format the HDFS filesystem:

bash
hdfs namenode -format

Step 4: Start Hadoop Services

Start HDFS and YARN services:

bash
start-dfs.sh start-yarn.sh

Step 5: Install Hive (Supported Version: 3.1.2)

  1. Download Hive: Go to the Hive releases page and download the latest stable version (e.g., 3.1.2).

    bash
    wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
  2. Extract and move Hive to the appropriate directory:

    bash
    tar -xvzf apache-hive-3.1.2-bin.tar.gz sudo mv apache-hive-3.1.2-bin /usr/local/hive
  3. Set Hive Environment Variables: Open your .bashrc file to set the Hive environment variables:

    bash
    nano ~/.bashrc

    Add the following:

    bash
    export HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin export HIVE_CONF_DIR=$HIVE_HOME/conf

    Apply the changes:

    bash
    source ~/.bashrc
  4. Configure Hive: Configure Hive by editing the hive-site.xml file.

    bash
    nano /usr/local/hive/conf/hive-site.xml

    Add the following configuration:

    xml
    <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=/usr/local/hive/metastore_db;create=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> </property> </configuration>
  5. Start Hive: You can now start Hive using the following command:

    bash
    hive

Step 6: Install Apache Pig (Supported Version: 0.17.0)

  1. Download Apache Pig: Go to the Pig releases page and download the latest stable version (e.g., 0.17.0).

    bash
    wget https://downloads.apache.org/pig/pig-0.17.0/apache-pig-0.17.0.tar.gz
  2. Extract and move Pig:

    bash
    tar -xvzf apache-pig-0.17.0.tar.gz sudo mv apache-pig-0.17.0 /usr/local/pig
  3. Set Pig Environment Variables: Open your .bashrc file to set the Pig environment variables:

    bash
    nano ~/.bashrc

    Add:

    bash
    export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin

    Apply the changes:

    bash
    source ~/.bashrc
  4. Start Apache Pig: You can now start Apache Pig using the following command:

    bash
    pig

Conclusion

You’ve successfully installed Hadoop, Hive, and Pig on Ubuntu. To summarize:

  • Install Java (JDK 8 or newer).
  • Install Hadoop (3.3.x).
  • Configure Hadoop's environment variables and HDFS/YARN settings.
  • Install Hive (3.1.2) and configure the hive-site.xml file.
  • Install Apache Pig (0.17.0) and configure the environment.

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

  1. What is NLP? NLP is a field of artificial intelligence (AI) that focuses on the interaction between computers and human languages. Its...