This is the second in the set of 6 parts of Pandas tutorials.
|1.Creating Pandas data structures|
|3.Indexing and Selecting data|
|4.Merge and Append|
|6.Grouping and Summarizing|
Pre-processing data is one of the most time consuming task of any data science assignment. Data comes in all different forms. It might have some missing values as well as duplicate rows. It is desirable to take an action on such data elements while performing exploratory data analysis (EDA) and before applying machine learning (ML) algorithms. So every time, we fill a value or delete a row, it is advisable to re-visit the properties of the data to verify whether the changes are reflected as desired.
In this part, we will look at 5 or 6 fundamental built-in functions that can give an overview of any given dataframe and its properties. Since, ‘Series’ object is a subset of ‘dataframes’. We will work only on dataframes and discuss series only when required.
Inspecting a dataframe is the most repeated activity while pre-processing data. Most important ones are: (Guess the output for each one of them)
- shape: Hint: Relates to dimensions
- info(): Hint: Talks about columns and their data types
- describe(): Hint: Descriptive statistics (Aggregate functions)
- head() or tail(): Hint: Displays the dataframe
As a starting point, let’s inspect one of the dataframes we created. Once we get familiar with what each function does, we will see another more realistic data set to understand the significance of these functions.
Inspecting dataframes: Part-1 – Shape, info(), columns:
Mostly the functions are self-explanatory. But let us discuss anyways. ‘shape’ gives the dimensions (rows, columns) of a given dataframe. In case of series object, since it is a 1-D object, shape will give the no of rows. Similarly, ‘columns’ function displays all the column names/labels. ‘df_excel_new.values’ is another such function. Guess the output and then try.
‘Info()’ displays a collection of information. It gives us the column names, data types, no of non-null values and the memory of the dataframe. One look at the output for info(), we can answer questions like ‘How large is the dataframe?’, ‘ What are all the columns having missing values?’ ‘Which column has more number of missing values?’
This kind of information helps us decide whether to drop a column or fill the values while performing exploratory data analysis (EDA). For example, we have a column ‘X’ where 4 out of 5 values are ‘NaN’. In such cases, it is appropriate to drop the column since 80% of data is missing.
Inspecting dataframes: describe()
‘describe()’ function by default picks up all the columns with datatype of numbers (int, float ..) and summarize the distribution of the each column. In our dataframe, we have only ‘Age’ column with integer datatype.
Inspecting dataframes: head() or tail():
displays the requested number of rows from first. ‘tail()’ does the same from the last row. ‘head()’ is typically the first function that is called after a dataframe is created. It gives us an idea of how our dataframe looks like.
Now let us pick a dataset where we understand the significance of these functions.
This is a global superstore data which is generally used for learning visualization. We can download data from http://www.tableau.com/sites/default/files/training/global_superstore.zip. Here, I want to see how big the data is. A simple ‘df_superstore.shape’ is showing it has more than 50 thousand rows. We see that there is no point in displaying the entire data set.
One look at the columns and we know this data is the records of all orders made by customers across different outlets of same store in many countries.
Try ‘head()’ and ‘info()’ and learn.
- What column has highest number of missing values and whether it can be dropped?
- Guess what can be made as an index in place of ‘row id’? Remember! It’s good practice to have unique index for each row.
Let us see what ‘describe()’ function tells us.
‘describe()’ simply picked up all the integer and float datatype columns and summarized the statistics. Now, what can we infer?
- Statistics for Row ID and Postal code are useless. (Common sense!)
- In most cases, profit is USD 9. (check 50th percentile in profits)
- Not much discounts are running in the stores.
Note: 50th percentile is the median of the data.
Likewise, try and infer insights of your own.
In summary, inspecting dataframe is an important part and the most repeated activity while performing pre-processing. A set of simple functions presents a better idea of the data at hand than data itself.
Next! Third tutorial: Indexing and Selecting Pandas dataframes (Part 3)