This is the forth in the set of 6 parts of Pandas tutorials.

1.Creating Pandas data structures
2.Inspecting dataframes
3.Indexing and Selecting data
4.Merge and Concat
5.Arithmetic Operations
6.Grouping and Summarizing

Any business problem needs data from multiple departments to analyse in order to get some perspective. Each department uses a different database according to their functionality. Every time we do some analysis, it is inevitable that we combine dataframes from different databases.

For example: ‘X’ engineering college wants to understand the relation between a student’s ‘grades’ and their ‘salary after 5 years of passing out’. Student grades are available in the respective department databases while their current salaries are available in the alumni database. In order to do this analysis, we have to combine both the databases. 

In this tutorial, we will learn how to ‘merge’ and ‘concatenate’ multiple dataframes.

Merging dataframes: ‘Merge’

The basic syntax is: df_merged = pd.merge(df_1, df_2, how=‘inner/outer’, on=‘Column name’)

Merge: ‘inner’ Let us look at an example:

In the example with visa-applicants data (df_excel_new), we want to add ‘Period of stay’ for each applicant. This data is available in a different dataframe, ‘df’ . We merged both the dataframes using the argument ‘how = inner’. This works same as ‘inner join’ in MySql. The word ‘inner’ makes sure that only common data in both the dataframes is combined in the resultant dataframe.

The other argument on = ‘Applicant No’ is the common column on which the dataframe has to merge.

Merge: ‘Outer’

With argument, how = ‘outer’, we combine dataframes including all rows from both dataframes. Please observe that, if an ‘Applicant no’ is not present in one of the dataframes, the values are populated as ‘NaN’. Check ‘Applicant 6’ to understand the result.

Similarly, we can perform ‘how = left’ and ‘how = right’ merges. Try and interpret the results.

Concatenating dataframes: ‘concat’

The basic syntax is: pd.concat([df1, df2], axis = 0/1)

‘pd.concat’ takes list of dataframes as first argument and axis to define row or column wise operation as the other argument. We can take more than 2 dataframes to combine.

Note: Repeating row labels is not a good practice.

Guess and try the following:

Hint 1:  ‘Axis = 1’ performs column wise operation.

Hint 2: We can concatenate subset dataframes.

Try to find the resemblance of query 1 with ‘merge’.

In summary, we learnt how to ‘merge’ and ‘concat’ two dataframes. ‘concat’ can be used  in place of ‘merge’ but the reverse is not true.  There is another syntax ‘df1.appned(df2)’ which we did not discuss in this  tutorial which works similar to ‘concat’.

Next we shall discuss ‘Arithmetic Operations’ on dataframes. This is similar to the arithmetic operations on NumPy arrays.

 Next! Fifth tutorial: Arithmetic Operations (Part 5)

Leave a Reply

Your email address will not be published. Required fields are marked *