Hadoop learning pot: pig bda lab

Step 1: Create Sample Data in Pig

We need to first create some sample data to work with. Typically, sample data can be stored in HDFS (Hadoop Distributed File System) or a local file system.

For this example, let's assume we're using the local file system and create a file with sample data.

Sample Data (let’s use a file called `people_data.txt`):

plaintext
Alice,30,New York
Bob,25,California
Charlie,35,Texas
David,30,California
Eva,20,New York
Frank,40,California
Grace,30,Texas
Hannah,45,New York
Ivy,25,California
Jack,20,Texas

This data represents a list of people with their name, age, and city.

Step 1.1: Upload the Data to HDFS (Optional)

If you’re using HDFS, you can upload the data using the following command:

bash
hdfs dfs -put people_data.txt /user/hadoop/

For simplicity, we’ll assume that the data is available in the local file system at people_data.txt.

Step 2: Start Pig Script

Now, we will start creating a Pig script. Let's open a new Pig script file and call it people_operations.pig.

bash
vi people_operations.pig

Step 3: Load the Sample Data into Pig

In the script, start by loading the sample data into a Pig relation.

pig
-- Load the sample data from the local file system
people = LOAD 'people_data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);

Step 4: Perform Operations Step by Step

4.1: Sort

We will now sort the data based on the age field in descending order.

pig
sorted_people = ORDER people BY age DESC;

4.2: Group

Next, let's group the data by city so we can perform operations like counting or applying other aggregations based on cities.

pig
grouped_by_city = GROUP people BY city;

This will create a new relation where the people data is grouped by the city field.

4.3: Join

Now, let’s create a second relation that has additional data, say a "city population" relation, and join it with the first relation. Let's assume we have a file city_population.txt with city population data.

Example of city_population.txt:

plaintext
New York,8419600
California,39538223
Texas,29145505

Load this data:

pig
city_population = LOAD 'city_population.txt' USING PigStorage(',') AS (city:chararray, population:int);

Perform a join to get the population of each city along with the people’s data:

pig
joined_data = JOIN grouped_by_city BY group, city_population BY city;

4.4: Project

Now, let's project the columns we want to see. For example, we want to display the name, age, city, and population after the join.

pig
projected_data = FOREACH joined_data GENERATE people.name, people.age, city_population.city, city_population.population;

4.5: Filter

Finally, let’s filter the results to only include people who are above 30 years old.

pig
filtered_data = FILTER projected_data BY age > 30;

Step 5: Dump the Results

Now, let’s dump the results to see the output.

pig
DUMP filtered_data;

Complete Pig Script (`people_operations.pig`)

Here is the complete Pig script:

pig
-- Load the sample data from the local file system
people = LOAD 'people_data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);

-- Sort the data by 'age' in descending order
sorted_people = ORDER people BY age DESC;

-- Group the data by 'city'
grouped_by_city = GROUP people BY city;

-- Load the city population data
city_population = LOAD 'city_population.txt' USING PigStorage(',') AS (city:chararray, population:int);

-- Join the people data with the city population data on 'city'
joined_data = JOIN grouped_by_city BY group, city_population BY city;

-- Project the columns to include name, age, city, and population
projected_data = FOREACH joined_data GENERATE people.name, people.age, city_population.city, city_population.population;

-- Filter to include only people with age greater than 30
filtered_data = FILTER projected_data BY age > 30;

-- Dump the results
DUMP filtered_data;

Step 6: Run the Pig Script

Now, you can run this Pig script using the following command:

bash
pig people_operations.pig

Step 7: Output

After running the script, you should see the filtered and projected results. The output will show people older than 30, along with their name, age, city, and population.

Example Output (after filtering age > 30):

plaintext
(Frank,40,California,39538223)
(Hannah,45,New York,8419600)
(Charlie,35,Texas,29145505)

Explanation of Operations:

LOAD: We load data from a text file into a Pig relation (people and city_population).
ORDER: We sort the people by age in descending order.
GROUP: We group the data by city so we can perform operations like joins or aggregations based on city.
JOIN: We join the people data with city population data using the city field.
FOREACH: We project the data to extract specific columns.
FILTER: We filter the data to include only people who are older than 30.
DUMP: We display the final result.

Hadoop learning pot

Search This Blog

Sunday, 23 March 2025

pig bda lab

Step 1: Create Sample Data in Pig

Sample Data (let’s use a file called `people_data.txt`):

Step 1.1: Upload the Data to HDFS (Optional)

Step 2: Start Pig Script

Step 3: Load the Sample Data into Pig

Step 4: Perform Operations Step by Step

4.1: Sort

4.2: Group

4.3: Join

4.4: Project

4.5: Filter

Step 5: Dump the Results

Complete Pig Script (`people_operations.pig`)

Step 6: Run the Pig Script

Step 7: Output

Example Output (after filtering age > 30):

Explanation of Operations:

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Search This Blog

Search This Blog

Sunday, 23 March 2025

pig bda lab

Step 1: Create Sample Data in Pig

Sample Data (let’s use a file called people_data.txt):

Step 1.1: Upload the Data to HDFS (Optional)

Step 2: Start Pig Script

Step 3: Load the Sample Data into Pig

Step 4: Perform Operations Step by Step

4.1: Sort

4.2: Group

4.3: Join

4.4: Project

4.5: Filter

Step 5: Dump the Results

Complete Pig Script (people_operations.pig)

Step 6: Run the Pig Script

Step 7: Output

Example Output (after filtering age > 30):

Explanation of Operations:

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Sample Data (let’s use a file called `people_data.txt`):

Complete Pig Script (`people_operations.pig`)