Search This Blog

Sunday, 23 March 2025

hadoop pyspark word count example from installation

 step-by-step guide to set up Hadoop, PySpark, and implement a Word Count program using PySpark. I'll walk you through the setup process and show you how to write and run a simple Word Count program.


Step 1: Install Hadoop

  1. Download Hadoop:

    • You can download the latest stable version of Hadoop from the official Apache website: Apache Hadoop Downloads.

    • Choose a version (e.g., Hadoop 3.x.x) and download the binary release suitable for your system.

  2. Extract Hadoop:

    • After downloading Hadoop, extract the files:

    bash
    tar -xvzf hadoop-3.x.x.tar.gz
  3. Set Up Hadoop Environment Variables:

    • Edit the .bashrc or .bash_profile file to add Hadoop environment variables:

    bash
    nano ~/.bashrc

    Add the following lines to the end of the file:

    bash
    export HADOOP_HOME=/path/to/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

    Replace /path/to/hadoop with the actual Hadoop installation path.

  4. Configure Hadoop:

    • Edit the Hadoop configuration files, which are located in $HADOOP_HOME/etc/hadoop/:

      • core-site.xml: Specify HDFS URI.

        xml
        <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
      • hdfs-site.xml: Define the replication factor and data directories.

        xml
        <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/tmp/hadoop/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/tmp/hadoop/dfs/data</value> </property> </configuration>
      • Start Hadoop:

        bash
        start-dfs.sh
      • Verify Hadoop is running:

        bash
        jps

Step 2: Install Apache Spark with PySpark

  1. Download Apache Spark:

    • Go to the Apache Spark official website and download the Spark binary release (e.g., spark-3.x.x-bin-hadoop3.x.tgz): Spark Downloads.

  2. Extract Apache Spark:

    bash
    tar -xvzf spark-3.x.x-bin-hadoop3.x.tgz
  3. Set Up Spark Environment Variables:

    • Similar to Hadoop, edit .bashrc or .bash_profile and add Spark environment variables:

    bash
    nano ~/.bashrc

    Add the following lines:

    bash
    export SPARK_HOME=/path/to/spark export PATH=$PATH:$SPARK_HOME/bin

    Replace /path/to/spark with the actual path where Spark is installed.

  4. Verify Spark Installation:

    bash
    spark-shell

    If everything is set up correctly, it will open the Spark shell in Scala. To use PySpark, you can run:

    bash
    pyspark

Step 3: Set Up PySpark

  1. Install PySpark (if not already installed): You can install PySpark using pip:

    bash
    pip install pyspark
  2. Verify PySpark Installation: Run the following command to verify that PySpark is correctly installed:

    bash
    python -c "import pyspark; print(pyspark.__version__)"

Step 4: Implement Word Count Program in PySpark

Now that Hadoop and Spark are set up, let's implement a Word Count program in PySpark.

  1. Prepare Input Data:

    First, create a simple input file, input.txt, with some text data. For example:

    txt
    Hello Hadoop and Spark Hello Hadoop Spark Spark Welcome to the world of Big Data
  2. Upload the Input Data to HDFS:

    Upload the input.txt file to HDFS:

    bash
    hadoop fs -put input.txt /user/hadoop/input/
  3. Write the PySpark Word Count Program:

    Create a new Python file wordcount.py:

    python
    from pyspark import SparkContext, SparkConf # Step 1: Initialize a SparkContext conf = SparkConf().setAppName("WordCount") sc = SparkContext(conf=conf) # Step 2: Read the input text file from HDFS input_file = sc.textFile("hdfs://localhost:9000/user/hadoop/input/input.txt") # Step 3: Perform transformations to count words # 1. Split the lines into words words = input_file.flatMap(lambda line: line.split()) # 2. Map each word to a tuple of (word, 1) word_counts = words.map(lambda word: (word, 1)) # 3. Reduce by key to sum the counts for each word word_counts_reduced = word_counts.reduceByKey(lambda a, b: a + b) # Step 4: Save the output to HDFS word_counts_reduced.saveAsTextFile("hdfs://localhost:9000/user/hadoop/output/") # Step 5: Stop the SparkContext sc.stop()

Explanation of the PySpark Program:

  1. Initialize SparkContext: We set up a SparkContext to allow interaction with the Spark cluster.

  2. Read Input Data: We read the text file from HDFS into an RDD using sc.textFile().

  3. Transformations:

    • flatMap: Splits each line into words.

    • map: Transforms each word into a tuple (word, 1).

    • reduceByKey: Aggregates the counts for each word.

  4. Save Output: We save the resulting word count to HDFS using saveAsTextFile().

  5. Stop SparkContext: Finally, we stop the SparkContext.


Step 5: Run the PySpark Job

To run the wordcount.py script with Spark, use the spark-submit command:

bash
spark-submit --master yarn --deploy-mode cluster wordcount.py

Where:

  • --master yarn: Indicates that the job will run on a YARN cluster.

  • --deploy-mode cluster: Runs the job in cluster mode (driver runs on cluster node).

Alternatively, to run locally:

bash
spark-submit --master local wordcount.py

Step 6: Check the Output

Once the program runs successfully, you can check the output in HDFS:

bash
hadoop fs -ls /user/hadoop/output/

You should see output files like part-00000 in the /user/hadoop/output/ directory. To view the results:

bash
hadoop fs -cat /user/hadoop/output/part-00000

This should display the word counts, such as:

txt
Hello 2 Hadoop 2 Spark 3 and 1 Big 1 Data 1 Welcome 1 to 1 world 1 the 1 of 1

Step 7: Clean Up

Once done, you can remove the input and output files from HDFS to clean up:

bash
hadoop fs -rm -r /user/hadoop/input/ hadoop fs -rm -r /user/hadoop/output/

Summary

To summarize, the steps to set up Hadoop and PySpark and implement a Word Count program are:

  1. Set Up Hadoop: Install and configure Hadoop, and set up HDFS.

  2. Install Spark: Download, install, and configure Apache Spark.

  3. Install PySpark: Install the PySpark package using pip.

  4. Write the PySpark Word Count Program: Implement the Word Count program using PySpark.

  5. Run the Program: Use spark-submit to run the program on Hadoop/YARN or locally.

  6. Check the Output: View the output files on HDFS.

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

  1. What is NLP? NLP is a field of artificial intelligence (AI) that focuses on the interaction between computers and human languages. Its...