Hadoop learning pot: hadoop pyspark word count example from installation

step-by-step guide to set up Hadoop, PySpark, and implement a Word Count program using PySpark. I'll walk you through the setup process and show you how to write and run a simple Word Count program.

Step 1: Install Hadoop

Download Hadoop:
- You can download the latest stable version of Hadoop from the official Apache website: Apache Hadoop Downloads.
- Choose a version (e.g., Hadoop 3.x.x) and download the binary release suitable for your system.
Extract Hadoop:
- After downloading Hadoop, extract the files:
```
bash
tar -xvzf hadoop-3.x.x.tar.gz
```
Set Up Hadoop Environment Variables:
- Edit the .bashrc or .bash_profile file to add Hadoop environment variables:
```
bash
nano ~/.bashrc
```
Add the following lines to the end of the file:
```
bash
export HADOOP_HOME=/path/to/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
```
Replace /path/to/hadoop with the actual Hadoop installation path.

Configure Hadoop:

Edit the Hadoop configuration files, which are located in $HADOOP_HOME/etc/hadoop/:

core-site.xml: Specify HDFS URI.

xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml: Define the replication factor and data directories.

xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/tmp/hadoop/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/tmp/hadoop/dfs/data</value>
    </property>
</configuration>

Start Hadoop:
```
bash
start-dfs.sh
```
Verify Hadoop is running:
```
bash
jps
```

Step 2: Install Apache Spark with PySpark

Download Apache Spark:
- Go to the Apache Spark official website and download the Spark binary release (e.g., spark-3.x.x-bin-hadoop3.x.tgz): Spark Downloads.

Extract Apache Spark:

bash
tar -xvzf spark-3.x.x-bin-hadoop3.x.tgz

Set Up Spark Environment Variables:
- Similar to Hadoop, edit .bashrc or .bash_profile and add Spark environment variables:
```
bash
nano ~/.bashrc
```
Add the following lines:
```
bash
export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin
```
Replace /path/to/spark with the actual path where Spark is installed.
Verify Spark Installation:
```
bash
spark-shell
```
If everything is set up correctly, it will open the Spark shell in Scala. To use PySpark, you can run:
```
bash
pyspark
```

Step 3: Set Up PySpark

Install PySpark (if not already installed): You can install PySpark using pip:
```
bash
pip install pyspark
```
Verify PySpark Installation: Run the following command to verify that PySpark is correctly installed:
```
bash
python -c "import pyspark; print(pyspark.__version__)"
```

Step 4: Implement Word Count Program in PySpark

Now that Hadoop and Spark are set up, let's implement a Word Count program in PySpark.

Prepare Input Data:

First, create a simple input file, input.txt, with some text data. For example:
```
txt
Hello Hadoop and Spark
Hello Hadoop Spark Spark
Welcome to the world of Big Data
```
Upload the Input Data to HDFS:

Upload the input.txt file to HDFS:
```
bash
hadoop fs -put input.txt /user/hadoop/input/
```

Write the PySpark Word Count Program:

Create a new Python file wordcount.py:

python
from pyspark import SparkContext, SparkConf

# Step 1: Initialize a SparkContext
conf = SparkConf().setAppName("WordCount")
sc = SparkContext(conf=conf)

# Step 2: Read the input text file from HDFS
input_file = sc.textFile("hdfs://localhost:9000/user/hadoop/input/input.txt")

# Step 3: Perform transformations to count words
# 1. Split the lines into words
words = input_file.flatMap(lambda line: line.split())

# 2. Map each word to a tuple of (word, 1)
word_counts = words.map(lambda word: (word, 1))

# 3. Reduce by key to sum the counts for each word
word_counts_reduced = word_counts.reduceByKey(lambda a, b: a + b)

# Step 4: Save the output to HDFS
word_counts_reduced.saveAsTextFile("hdfs://localhost:9000/user/hadoop/output/")

# Step 5: Stop the SparkContext
sc.stop()

Explanation of the PySpark Program:

Initialize SparkContext: We set up a SparkContext to allow interaction with the Spark cluster.
Read Input Data: We read the text file from HDFS into an RDD using sc.textFile().
Transformations:
- flatMap: Splits each line into words.
- map: Transforms each word into a tuple (word, 1).
- reduceByKey: Aggregates the counts for each word.
Save Output: We save the resulting word count to HDFS using saveAsTextFile().
Stop SparkContext: Finally, we stop the SparkContext.

Step 5: Run the PySpark Job

To run the wordcount.py script with Spark, use the spark-submit command:

bash
spark-submit --master yarn --deploy-mode cluster wordcount.py

Where:

--master yarn: Indicates that the job will run on a YARN cluster.
--deploy-mode cluster: Runs the job in cluster mode (driver runs on cluster node).

Alternatively, to run locally:

bash
spark-submit --master local wordcount.py

Step 6: Check the Output

Once the program runs successfully, you can check the output in HDFS:

bash
hadoop fs -ls /user/hadoop/output/

You should see output files like part-00000 in the /user/hadoop/output/ directory. To view the results:

bash
hadoop fs -cat /user/hadoop/output/part-00000

This should display the word counts, such as:

txt
Hello 2
Hadoop 2
Spark 3
and 1
Big 1
Data 1
Welcome 1
to 1
world 1
the 1
of 1

Step 7: Clean Up

Once done, you can remove the input and output files from HDFS to clean up:

bash
hadoop fs -rm -r /user/hadoop/input/
hadoop fs -rm -r /user/hadoop/output/

Summary

To summarize, the steps to set up Hadoop and PySpark and implement a Word Count program are:

Set Up Hadoop: Install and configure Hadoop, and set up HDFS.
Install Spark: Download, install, and configure Apache Spark.
Install PySpark: Install the PySpark package using pip.
Write the PySpark Word Count Program: Implement the Word Count program using PySpark.
Run the Program: Use spark-submit to run the program on Hadoop/YARN or locally.
Check the Output: View the output files on HDFS.

Hadoop learning pot

Search This Blog

Sunday, 23 March 2025

hadoop pyspark word count example from installation