Search This Blog

Sunday, 23 March 2025

pig bda lab

 

Step 1: Create Sample Data in Pig

We need to first create some sample data to work with. Typically, sample data can be stored in HDFS (Hadoop Distributed File System) or a local file system.

For this example, let's assume we're using the local file system and create a file with sample data.

Sample Data (let’s use a file called people_data.txt):

plaintext
Alice,30,New York Bob,25,California Charlie,35,Texas David,30,California Eva,20,New York Frank,40,California Grace,30,Texas Hannah,45,New York Ivy,25,California Jack,20,Texas

This data represents a list of people with their name, age, and city.

Step 1.1: Upload the Data to HDFS (Optional)

If you’re using HDFS, you can upload the data using the following command:

bash
hdfs dfs -put people_data.txt /user/hadoop/

For simplicity, we’ll assume that the data is available in the local file system at people_data.txt.

Step 2: Start Pig Script

Now, we will start creating a Pig script. Let's open a new Pig script file and call it people_operations.pig.

bash
vi people_operations.pig

Step 3: Load the Sample Data into Pig

In the script, start by loading the sample data into a Pig relation.

pig
-- Load the sample data from the local file system people = LOAD 'people_data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);

Step 4: Perform Operations Step by Step

4.1: Sort

We will now sort the data based on the age field in descending order.

pig
sorted_people = ORDER people BY age DESC;

4.2: Group

Next, let's group the data by city so we can perform operations like counting or applying other aggregations based on cities.

pig
grouped_by_city = GROUP people BY city;

This will create a new relation where the people data is grouped by the city field.

4.3: Join

Now, let’s create a second relation that has additional data, say a "city population" relation, and join it with the first relation. Let's assume we have a file city_population.txt with city population data.

Example of city_population.txt:

plaintext
New York,8419600 California,39538223 Texas,29145505

Load this data:

pig
city_population = LOAD 'city_population.txt' USING PigStorage(',') AS (city:chararray, population:int);

Perform a join to get the population of each city along with the people’s data:

pig
joined_data = JOIN grouped_by_city BY group, city_population BY city;

4.4: Project

Now, let's project the columns we want to see. For example, we want to display the name, age, city, and population after the join.

pig
projected_data = FOREACH joined_data GENERATE people.name, people.age, city_population.city, city_population.population;

4.5: Filter

Finally, let’s filter the results to only include people who are above 30 years old.

pig
filtered_data = FILTER projected_data BY age > 30;

Step 5: Dump the Results

Now, let’s dump the results to see the output.

pig
DUMP filtered_data;

Complete Pig Script (people_operations.pig)

Here is the complete Pig script:

pig
-- Load the sample data from the local file system people = LOAD 'people_data.txt' USING PigStorage(',') AS (name:chararray, age:int, city:chararray); -- Sort the data by 'age' in descending order sorted_people = ORDER people BY age DESC; -- Group the data by 'city' grouped_by_city = GROUP people BY city; -- Load the city population data city_population = LOAD 'city_population.txt' USING PigStorage(',') AS (city:chararray, population:int); -- Join the people data with the city population data on 'city' joined_data = JOIN grouped_by_city BY group, city_population BY city; -- Project the columns to include name, age, city, and population projected_data = FOREACH joined_data GENERATE people.name, people.age, city_population.city, city_population.population; -- Filter to include only people with age greater than 30 filtered_data = FILTER projected_data BY age > 30; -- Dump the results DUMP filtered_data;

Step 6: Run the Pig Script

Now, you can run this Pig script using the following command:

bash
pig people_operations.pig

Step 7: Output

After running the script, you should see the filtered and projected results. The output will show people older than 30, along with their name, age, city, and population.

Example Output (after filtering age > 30):

plaintext
(Frank,40,California,39538223) (Hannah,45,New York,8419600) (Charlie,35,Texas,29145505)

Explanation of Operations:

  1. LOAD: We load data from a text file into a Pig relation (people and city_population).

  2. ORDER: We sort the people by age in descending order.

  3. GROUP: We group the data by city so we can perform operations like joins or aggregations based on city.

  4. JOIN: We join the people data with city population data using the city field.

  5. FOREACH: We project the data to extract specific columns.

  6. FILTER: We filter the data to include only people who are older than 30.

  7. DUMP: We display the final result.

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

  1. What is NLP? NLP is a field of artificial intelligence (AI) that focuses on the interaction between computers and human languages. Its...