This is the third in the set of 6 parts of Pandas tutorials.
|1.Creating Pandas data structures|
|3.Indexing and Selecting data|
|4.Merge and Append|
|6.Grouping and Summarizing|
In this part, we will learn
- Regular python way of indexing a dataframe
- Position based indexing (iloc) – (‘Pandas’ recommends this way of indexing)
- Label based indexing (loc) – (‘Pandas’ recommends this way of indexing)
Indexing and Selecting data:
Selecting rows: We look at 2 ways of selecting rows from a dataframe. It is similar to NumPy arrays or other data structures in python. While using pythonic way of indexing, the basic syntax is df[start_index: end_index].
In the first cell, we selected rows using default indices. For example, when we request for ‘2:5’ rows, python interprets that as a request to pull out rows from 2 to (5 -1= 4)th row.
In the second cell, we selected rows using row labels. The basic syntax is df[start_label: end_label]. While using row labels, pandas interpret that as a request to retrieve data including both start_label and stop_label. For instance, in the above example: since the given range is ‘Applicant 3’: ‘Applicant 5’, it selected rows with row labels: ‘Applicant 3’, ‘Applicant 4’ and ‘Applicant 5’.
Selecting columns: Selecting columns is simply by calling the column name. There are two ways of selecting single column from a dataframe. Syntax is df[‘column_name’] or df.column_name.
While retrieving single column, it should be noted that pandas makes it a series object since it has only one column. It is a matter of choice which syntax to use while selecting single column. Throughout the tutorial, I will use the syntax df[‘ column name ‘] to avoid ambiguity.
To select 2 or more columns, column names should be given as list of lists, [[‘ ‘ , ’ ‘ ]]. Syntax is df[[‘column name 1 ’ , ’column name 2 ‘]]. Check the type of the object retrieved with multiple columns.
Position and label based indexing: (using df.loc and df.iloc)
We have seen some ways of selecting rows and columns from dataframes. Let’s now see some other ways of indexing dataframes that pandas recommends, since they are more explicit (and less ambiguous).There are two main ways of indexing here:
- Position based indexing using df.iloc: ‘iloc’ extends to integer location based indexing
- Label based indexing using df.loc: ‘loc’ is simply label based indexing.
Using both the methods, we will do the following indexing operations on a dataframe:
- Selecting single elements/cells
- Selecting single and multiple rows
- Selecting single and multiple columns
- Selecting multiple rows and columns
The syntaxes are similar to what we discussed earlier, we simply add ‘iloc’ and ‘loc’ functions before indexing. These functions make it easier for pandas to understand the style of indexing.
Position based indexing using ‘iloc’: Even when we have labels to our rows and columns, inherently, objects are saved using the traditional style of indexing that python uses. By calling the function ‘iloc’, we are explicitly asking pandas to access the dataframe using the default indexing style of python/pandas from ‘0-n-1’.
Position based indexing ‘iloc’: Selecting single element/cell
In this example we are simply selecting the second row and third element by accessing ‘[1,2]’. Guess which element do we access using iloc.[2,4].
Position based indexing ‘iloc’: Selecting single and multiple rows
The general syntax for selecting a row or a set of rows using ‘iloc’ is ‘df.iloc[i:j , :]’ or simply ‘df.iloc[i:j]’. ‘i : j’ refer to range of rows and ‘ : ’ without range refers to all columns. This indexing style is exactly same as we discussed in NumPy arrays. The only difference is the use of ‘iloc’.
Now let us look at the examples, first cell refers to selecting 5th row (single row) of a dataframe. Second cell refers to selecting 1st and 2nd rows.
Remember! This syntax selects only till j-1 rows.
Position based indexing ‘iloc’: Selecting single and multiple columns
Selecting columns is similar to selecting rows. Only the ranges are reversed in the syntax. The general syntax for selecting a row or a set of rows using ‘iloc’ is ‘df.iloc[: , i : j]’. ‘i : j’ refer to range of columns and ‘ : ’ without range refers to all rows. Try to access Name and Visa-type columns only.
Position based indexing ‘iloc’: Selecting multiple rows and columns
Here, we are simply selecting the subset of a dataframe. The syntax is ‘df.iloc[i:j,x:y]’. In the example, we are selecting (1st and 2nd) rows and (3rd and 4th) columns. Try selecting (3rd and 5th ) rows and (1st and 5th) columns.
Label based indexing, ‘loc’: It is a more user friendly syntax than iloc that we discussed just now. It is simply because we can look at the label in a dataframe but not the integer index. Whenever we want a column or a row, it is easy to access it simply by calling it with a label.
The general syntax is df.loc[‘row label range’ , ‘column label range’ ]
Label based indexing, ‘loc’: Selecting single element/cell
Once we know both column label and row label, the syntax is actually straight forward. In the example, we want to know ‘what the ‘visa-type’ of ‘applicant 3’ is’? The general syntax is df.loc[‘row label’ , ‘column label’]. Try working on query in second cell.
Label based indexing, ‘loc’: Selecting single and multiple rows
Selecting single row using ‘loc’ is similar to using ‘iloc’. Simply, ‘Index’ in iloc is replaced by ‘label’ in loc.
There is a small difference while selecting multiple rows. While using iloc[i : j], rows are selected till j-1 rows but loc[‘row label i’ : ‘row label j’] works slightly different. Here, rows are selected including jth row. We can also select by giving a list of lists.
It is similar in all other cases while:
- Selecting single and multiple columns: ‘df.loc[: , column label x : column label y ]’.
- Selecting rows and columns: ‘df.loc[‘row label i’ : ‘row label j’ , column label x : column label y ]’
Logic based indexing:
This is an extremely powerful syntax that we use while doing exploratory data analysis (EDA). Here, it simply selects all rows where the specified logic is ‘true’.
The general syntax is: df[df[‘column/row name’] <condition> <value>]
Let us see one more example, which can give more data science perspective. In the ‘superstore data’ we saw earlier, let us look at all the orders that gave profit more than 5000 usd.
When we work on this query, we understand that,
- There are only 3 orders where profit is higher than 5000 usd.
- All the products are in the ‘sub category’ of ‘copiers’. This should give us an idea that, these orders are all bulk orders from companies where they need stationary.
- One glance at ‘Sales’ and ‘profit’ columns tell us that, these bulk orders are giving 40 -50% profit in all three cases.
If I have to give some concluding remarks on this analysis, I would say:
- It’s better for sales people to target this customer segment since less volume of customers provides more profits.
- ‘copiers’ sub-category have more bulk orders.
I want to repeat that these are the kind of queries you repeatedly work on when you have a data science assignment.
In summary, we have seen different types of selecting subset dataframes. Now, we go in other direction, we will try to merge and append data in the next section.
Next! Forth tutorial: Merge and Append (Part 4)