This is the first in the set of 6 parts of Pandas tutorials.
|1.Creating Pandas data structures|
|2.Inspecting data frames|
|3.Indexing and Selecting Data|
|4.Merge and Append|
|6.Grouping and Summarizing|
Here we start from basics and learn through hands-on:
- What are Pandas data structures?
- How do we create one?
What are Pandas data structures?
There are 2 main data structures in pandas – Series and Dataframes. These are the basic objects we work with, in Pandas library.
Series: Series is a one dimensional (1-D) NumPy array (vector). The main difference is, series can be indexed and arrays can’t be. Index is nothing but an access label. Simply put, we can name rows as per our interest.
Dataframe: Dataframe is a two dimensional (2-D) array where each column can have a different datatype. We saw in NumPy that an array can only take single datatype for all its columns or elements. Similar to series, it takes access labels. We can label both rows and columns here. Typically, one can imagine it as a table.
Now that we know what they are, let us open our Jupyter Notebooks and start by importing NumPy and Pandas packages.
import pandas as pd
import numpy as np
‘pd’ and ‘np’ are standard aliases used. You can use any other aliases or just import without alias.
We will briefly look at Series object first and then get into dataframes. Eventually, you will get to work largely on dataframe objects since manipulating dataframes efficiently and quickly is the most important skill set if you choose python for your data science projects.
The basic syntax for creating a series object is: “series_1 = pd.Series(data, index)”
While calling series function, ‘S’ is always uppercase. ‘Index’ is optional and generates default index when unspecified. A series can be created taking input data as a:
- Dictionary, Lists, Tuples (Any Python data structure)
- Array (A NumPy object)
- Scalar value
Creating basic series object:
Let us create some series and look at the important properties.
We can observe that, while creating our first series object, we passed our data as a ‘list’. So, it takes another data structure/array object as an argument for ‘data’. Try giving a Tuple. Since we did not specify any index, it generated default index. In this example, we are inspecting data type of the variable, ‘s’ and it’s an object of class series while each element is an ‘int’.
Creating series with index: Let us see some examples by including ‘index’ as an argument.
In the first cell of the example, we are giving each and every row index distinctly. You can observe that, similar to ‘data’ argument, index is also passed as a ‘list’. In the second cell, we passed array object as data and a built-in function ‘range()’ as an index. This shows the level of flexibility pandas gives while creating an object.
Creating date series:
Next, we will see an example that might have some interesting applications.
There is a function ‘date_range()’ to create a series of dates. We simply gave start and end dates as arguments and the function created a sequence of dates. Our input argument has a date in the form of ‘MM-DD-YYYY’ but the output is in the standard form of ‘YYYY-MM-DD’. Replace end = ’11-16-2017’ with periods = 3, freq = ‘M’ and try to understand the outcome.
In the next part, we can see, I used the same dates as index for another series. We passed one series as an index for another series. Try this! Create two different series and pass one as an index to other.
Note: This example is a simple way to show how custom indexing can help in data analysis. Suppose you have student’s attendance sheet with no dates. Simply adding date series as an index will help understand the behaviour of any student in attending the class.
Creating series from dictionaries:
While creating series using dictionaries, pandas understands ‘key’ from dictionary as a ‘label’ in index. Try giving some external index argument and see what happens?
The syntax for creating a dataframe object is: “df = pd.DataFrame(data, index, columns, dtype)”
While creating a ‘DataFrame’, ‘D’ and ‘F’ are uppercase. We can give one or more arguments, with basic argument being ‘data’. Here, ‘index’ is same as in series while ‘columns’ take list of values for ‘column labels’. ‘columns’ also take default values of ‘0-n-1’ columns when unspecified.
A dataframe can be created taking input data:
- From other data structures in Python and NumPy :
- Dictionaries, Lists, Tuple (Python data structures)
- List of lists, Tuple of tuples (To create 2-D dataframes)
- 1-D or 2-D NumPy Arrays
- From external sources like:
- Excel Spreadsheets
- .csv extension files
- .txt extension files
- Direct web sources
- Many others
Since, Pandas is largely used as a data pre-processing tool, it provides the ability to read data from several file types to dataframes directly.
Creating dataframes from dictionaries:
We see that, DataFrame is nothing but a combination of series objects but each column can take a different data type. What can be the data type of each column in this dataframe?
It looks just like a table with columns and rows. It is important to understand that commonly each row is a different observation and each column is an attribute.
Creating dataframes from lists:
Can you observe any difference in syntax with lists from dictionaries? While creating a dataframe from lists, we need to give data ‘row -wise’ unlike dictionaries where it takes column wise information. Here, column names have to be given as a separate argument.
Try creating dataframes from tuples and series. Hint: Just similar to dataframes from lists.
Now that we saw how to create a simple dataframe from other data structures, we will go ahead and see how we read data from other sources such as excel, text or web pages directly into dataframes.
Creating dataframes from ‘.csv’, ‘.txt’ files:
Typically, at any given time while working on a data science project, we work with data having lakhs of rows and hundreds of columns. So it is highly likely that we read data from other files than make a dictionary and convert it into dataframe.
The most commonly used syntax is:
df = pd.read_csv(‘Path/filename.file_extension’, delimiter, index_col, names)
In the given syntax, ‘pd.read_csv()’ is the most commonly used function not only for ‘csv extension’ files but also while reading text files as well as files while reading data from web sources. We can use ‘index_col’ argument to suggest which column to use as row label. ‘names’ is used to manually give column labels.
Delimiter suggests where to partition data for each column in a given row. For Example: We have ‘5|6’, if we give argument ‘delimiter = |’. While making a dataframe, ‘Pandas’ understands that, 5 and 6 should be separated into 2 different columns.
I created 3 small files ‘Read_file.csv’, ‘Read_file.xlsx’, ‘Read_file.txt’ for which I will provide access to, try to upload files while following the notes to get some practice. Remember! Practice is the key. Let us see some examples.
In the first cell, we look at a very basic syntax of reading a file by giving ‘path+file name’ in quotes (‘ ’). To re-iterate, ‘read_csv()’ is used to load data from multiple file types (csv, txt..).
In the second example, we read data from text file using the same syntax and added two new arguments. You can go and observe the text file provided. I used the symbol ‘ | ’ to separate data in the text file. So, pandas only understands ‘ , ’ as delimiter by default but for every other delimiter, we have to provide the argument. In place of delimiter, we can also use an alias called sep = ‘|’.
We added another argument ‘index_col’. We can very well understand from the name that column number should be given to make it a row label/index of a dataframe. Try making column ‘Name’ as index for the dataframe. Lastly, every empty cell in the loaded data file will be filled with ‘NaN’ (not a number) while making a dataframe.
Creating dataframes from excel files:
While reading an excel file, ‘read_csv()’ is replaced by ‘read_excel()’. The column names in the first output are not informative. So, we changed the column names/labels using the argument ‘names’. Sometimes, we may not know what the most ideal index column is until the data is explored a bit. Function ‘set_index()’ is used to apply index to the dataframe at a later time. ‘inplace = True’ makes the change to the dataframe permanent. Check what happens if ‘inplace = False’?
Similarly, we can read data from webpage using syntax: df = read_csv(‘link address’).
In summary, we have seen how to create a pandas ‘series’ and ‘dataframes’ from different sources. We now have the data available in the form understandable to pandas. Now let’s get started with different concepts involved in data pre-processing.
Next! Second tutorial: Inspecting Pandas dataframes (Part 2)