This is the sixth in the set of 6 parts of Pandas tutorials.
|1.Creating Pandas data structures|
|3.Indexing and Selecting data|
|4.Merge and Concat|
|6.Grouping and Summarizing|
Grouping and summarizing are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common. Grouping together with summarization is used for answering categorical questions.
For e.g., in the superstore sales data we are working with, you may want tolook at most profitable shipping mode or most sold category products. This kind of information is captured using ‘groupby’ and summarization functions like ‘avg()’, value_counts(), etc..
Grouping analysis can be thought of as having three parts:
- Separating the data into groups (e.g. groups of customer segments, product categories, etc.)
- Applying a function to each group (e.g. mean or total sales of each customer segment)
- Transforming the results into a data structure showing the summary statistics (Optional, only if we want to further act upon data.)
This tutorial we will work through by answering a few analytical questions. In the superstore data, let us see what is the shipping mode with highest profit?
We answered the question diving it into parts. First, I want to know how many shipping modes are there. ‘unique()’ helps to identify the distinct values in a particular column. Here, we understand there are 4 types of shipping modes. Try to find the distinct categories of products.
Step 1: let us divide the data into groups by using ‘ship mode’. The ‘groupby’ does that for us. It divides data into categories. It creates a groupby object which cannot be viewed unless we apply an aggregate/summery function on that.
Step 2,3: Calculate the total amount of profit in each category using sum() and make a dataframe out of it. .
Step 4: Since the question is about the highest profit shipping mode, we sorted the values of dataframe using sort_values() in descending order. Remember! These are the kind of questions we constantly work with while doing EDA in data science projects.
Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:
|mean()||Compute mean of groups|
|sum()||Compute sum of group values|
|size()||Compute group sizes|
|count()||Compute count of group|
|std()||Standard deviation of groups|
|var()||Compute variance of groups|
|sem()||Standard error of the mean of groups|
|describe()||Generates descriptive statistics|
|min()||Compute min of group values|
|max()||Compute max of group values|
Answer the following questions on shipping mode using the aggregate functions given above.
- Average profit by ‘ship mode’
- Get the descriptive statistics of groups. (Hint: Refer to inspecting dataframe tutorial.)
- Count the values in each group.
Try to interpret results. Interpreting results develops intuition which is a much needed skill while doing these kinds of projects.
Grouping is a very important topic for which we only covered basics. Please refer this material to have an extensive read on the topic.
Well! This is all the Pandas you need to kick start your journey through Python for data sciences.
Congratulations! You are almost there.
Next Series in line is, Matplotlib Library: Data Visualization library.