Step 1: Create Sample Data in Pig
We need to first create some sample data to work with. Typically, sample data can be stored in HDFS (Hadoop Distributed File System) or a local file system.
For this example, let's assume we're using the local file system and create a file with sample data.
Sample Data (let’s use a file called people_data.txt
):
This data represents a list of people with their name, age, and city.
Step 1.1: Upload the Data to HDFS (Optional)
If you’re using HDFS, you can upload the data using the following command:
For simplicity, we’ll assume that the data is available in the local file system at people_data.txt
.
Step 2: Start Pig Script
Now, we will start creating a Pig script. Let's open a new Pig script file and call it people_operations.pig
.
Step 3: Load the Sample Data into Pig
In the script, start by loading the sample data into a Pig relation.
Step 4: Perform Operations Step by Step
4.1: Sort
We will now sort the data based on the age
field in descending order.
4.2: Group
Next, let's group the data by city
so we can perform operations like counting or applying other aggregations based on cities.
This will create a new relation where the people
data is grouped by the city
field.
4.3: Join
Now, let’s create a second relation that has additional data, say a "city population" relation, and join it with the first relation. Let's assume we have a file city_population.txt
with city population data.
Example of city_population.txt
:
Load this data:
Perform a join to get the population of each city along with the people’s data:
4.4: Project
Now, let's project the columns we want to see. For example, we want to display the name
, age
, city
, and population
after the join.
4.5: Filter
Finally, let’s filter the results to only include people who are above 30 years old.
Step 5: Dump the Results
Now, let’s dump the results to see the output.
Complete Pig Script (people_operations.pig
)
Here is the complete Pig script:
Step 6: Run the Pig Script
Now, you can run this Pig script using the following command:
Step 7: Output
After running the script, you should see the filtered and projected results. The output will show people older than 30, along with their name, age, city, and population.
Example Output (after filtering age > 30):
Explanation of Operations:
-
LOAD: We load data from a text file into a Pig relation (
people
andcity_population
). -
ORDER: We sort the people by
age
in descending order. -
GROUP: We group the data by
city
so we can perform operations like joins or aggregations based on city. -
JOIN: We join the people data with city population data using the
city
field. -
FOREACH: We project the data to extract specific columns.
-
FILTER: We filter the data to include only people who are older than 30.
-
DUMP: We display the final result.
No comments:
Post a Comment