Python has become an essential part of the learning process for the data science community. I am planning to make this topic as a series of articles. I want to carefully cover all the essential topics with sufficient examples to understand the respective concepts. Broadly in the series, I will cover:
- Why choose Python for data sciences?
- What are the essential libraries?
- Concepts on numerical computing python library – Numpy
- Concepts on data processing library – Pandas
- Concepts on visualization library – Matplotlib
In the present article, I would like to give some insight into how the need to develop these libraries evolved and why Python is the preferred choice in the data science community.
How the need to develop Python libraries evolved?
Before machine learning engineer or data scientist became job titles in the industries like retail, banking, manufacturing etc. Machine learning algorithms are mostly a mathematical research tool. Academicians, Statisticians and Research Scholars used these algorithms to future proof their results. When machine learning was only a laboratory tool, statistical programming languages like R and numerical computing environments like MATLAB served the purpose.
These languages are easy to understand but mathematically intensive. But when machine learning algorithms started their journey in providing real time business solutions, data scientists and software developers had no common interface to work with.
The business solution cycle looked like this
- Programmer: Get the business data out using query languages
- Data Scientist: Took the data, did the necessary analysis, found the model
- Programmer: Deployed the model in real time, did the predictions
Since we know machine learning models can learn from data, so again
- Programmer: Get the data out
- Data Scientist: Added the data to the algorithms, improved the model
- Programmer: Re-deployed the model in real time.
This became a monotonous cycle. They do not have a one stop solution, a single programming language that can do it all. It is then, a few data scientists, and programmers developed powerful statistical and data visualization libraries on top of Python and it’s then existing libraries .
They chose Python because; it is a high level language which is
- Easy to understand for a data scientist.
- It handles different data structures that a developer needs.
These contributions from data science and programming communities added with the advantage of being open source made Python a powerful language for machine learning.
Why Python for data science over other languages?
This is a much debated question across industries. As mentioned earlier, Python stands out as a one stop solution. It could be used across different platforms such as web development, windows applications, machine learning and as a general purpose programming language.
- Availability of data science libraries: The success of Python can be attributed to availability of data science libraries that too open source. A few significant libraries include:
- Numpy: Numerical computing library that performs mathematically intense operations on multi-dimensional arrays/matrices.
- Pandas: It is developed on top of Numpy. It’s basically a data-base query language but crafted to accept advanced numerical computations on data bases.
- Matplotlib: It is a 2D plotting library. It is a data visualization tool kit which gives histograms, bar charts, scatter plots and many more. It also appeals to the scientific community where frequency domain analysis like spectral analysis is preferred.
These are only a few and famous in the bucket. I am planning to discuss these libraries as part of the future articles in the aforementioned series.
- Extreme Scalability and speed: Python is a high level language. So being the fastest is not what it promises. But for the data science community, it is much faster than high end computing tools like Matlab or R. More than speed, Scalability is its strong suite. Anyone can develop an end to end application using Python.
- Extensive community support: For any open source language, community support is the key to its success. It helps new aspirants to quickly resolve the problems they face. It also helps develop more sophisticated libraries with ease.
- Easy to learn for any non-computer science graduates: It is primarily because of the flexibility in language. The syntaxes are more close to English semantics than any other programming language. Apart from that, the community is becoming instrumental in creating extensive course materials which are accessible and easy to understand.
With this I conclude the article. There is only so much one can take at a time particularly when we are new to it. So hang on there, I will come with detailed notes on Python libraries for data sciences/machine learning in the upcoming articles.
Articles next in the series:
In the immediate future, we will see comprehensive beginner tutorials on:
- NumPy in parts
- Pandas in parts
- Matplotlib in parts
I would like to take this further ahead but this is our immediate future.
Next! NumPy (Numerical Python Library) for beginners
I provided link to install anaconda distribution of python in next section. Install it if you do not have it already in place.
Anaconda distribution is the most trusted one for data science. It’s an open source distribution.
Anaconda installation is given in detail on its web page. Just make sure, while installing you install Python 3.x .Typically, any latest version available to you but of Python 3 .
Link for the same: