5th Story — Pandas, Pandas, Pandas… Libraries are your friends.

You are already a Pythonista (or like me at this point, you think you are and will eventually discover you still have a LOT to learn, but don’t worry too much about it, it happens to most of us, it is called the Dunning–Kruger effect), and are now ready to get to the next level.

You could simply go right ahead into building ML models or jump into deep learning and get confused between Convolutional Neural Networks and Adversarial Neural Networks as well as stumble into a Sigmoid and fall right through a Leaky ReLU. If none of this made any sense, breathe deeply, and keep reading.

Libraries will be your best friends for a long time. Being able to program anything from scratch is great, but the truth is, we do not have the time for it. The good news is that some amazing and giving people have taken the time and patience of building thousands upon thousands of lines of code and offer them for free so you can spend your time building even better models. These are called libraries (or packages in R).

You are probably already familiar with some of them, for example the Datetime library, math or itertools. All of them are extremely useful, but today I will focus on the most important libraries you should learn specifically for data science.

As I have mentioned before, being the fast, impatient, formula one speed learner, I always wanted to learn what was next, and that left me with a lot of holes in my knowledge which I then had to go back to improve. The biggest one of these was the NumPy library.

NumPy stands for numbers python, it is an incredible library for dealing with all things mathematics. It comes packed with functions for all your math and many statistics needs, but, its most important function, and the one you should spend enough time understanding, is the array. Arrays are one of the basic and most important structures in programming, and they are also fundamental knowledge in understanding Machine Learning. Without arrays, we have no Support Vector Machines, no Natural Language Processing and no Computer Vision. Arrays will be with you every step of the way, and NumPy is the library working with arrays. I will let you discover this amazing library on your own, but please, do take enough time to learn it well, and make sure you understand arrays deeply. NumPy is also fundamental to my favorite package for data manipulation (and vital to learn), Pandas.

What can I say, I love Pandas and also Pandas (the library and the bear). Pandas, the library, is filled with all the tools you need for data manipulation, in other words, you will use this over and over again to clean and prepare your data. Pandas the bear is mostly filled with bamboo.

The most important feature of Pandas, the library, is that it allows you to create the data structures you will need to run to analyze your data, from time series to the all-powerful Data Frame (I just love table structures). Make sure that, after you are comfortable with NumPy, you take your time with using and understanding Pandas. You can also take your time to get to know and understand Pandas, the bear, but apart from giving me personal joy, I have found no use for them in data science, yet.

Then we move on to SciPy, your one stop shop for every scientific need. It builds on the NumPy multidimensional array and will add plenty of amazing functions to your toolbox from statistics to signal processing. You might get along well without SciPy, and it is not absolutely needed for many data science tasks, but it is good to have a more varied toolset under your belt.

In my opinion, these are the three main libraries you will need on your data science journey, so please take the time to learn them well, practice with them and build a few interesting projects that use these libraries. The other main importance of these ones is that they have become commonplace in the industry, so odds are your future colleagues are already using them and are familiar with them, remember this the next time you see the now famous import numpy as np and import pandas as pd instructions during your learning.

There are a few other very important libraries, such as Tensorflow, ScikitLearn and Matplotlib, but since these are much larger, they will all get their own story in the future (well, matplotlib comes next time).

Enjoy learning all these libraries and keep moving forward on your Road to Data Science.

And as a reward for reading the whole article, I leave you with some baby Pandas (the bear not the library):

Hope we cross paths through our Journeys…

Jack Raifer Baruch

Follow me on Twitter: @JackRaifer

Follow me on LinkedIN: jackraifer

Next Story: If you can visualize it, you can explain it…

About the Road to Data Science Series

Today, I am working on the first steps of remarkably interesting projects for human development based on Data Science and Machine Learning.

But not that long ago (really, not long at all) I knew extraordinarily little about data science and much less what it all meant (and I am still learning more and more about it every day). In my quest for reinventing myself from Psychologist working in Behavioral Economics to Data Scientist I went through an incredibly interesting journey and learned a lot. This series is mostly a letter to my past self, to help anyone like me take this amazing road and, luckily, avoid some of the mistakes I made on the way due to lack of knowledge or perspective.

Hope you enjoy my ramblings as much as I found joy on my Road to Data Science.

Need Help on your Journey?

This can be a difficult path alone, so feel free to reach out to me through LinkedIN or Twitter. I started this series because of the #66DaysOfData initiative by Ken Jee, it is a great way to connect and get support, so just check out Ken on twitter @KenJee_DS and join the #66DaysOfData challenge.

Learning Resources I have Used:

Udemy

A LOT of content, some free, most paid. Check out cupon sites where you can usually find free cupons for courses on python, R, data science, machine learning and much more.

Codecademy

Interesting place to learn, they have some free courses and then paid content. Very hands on coding exercises, few videos, mostly reading.

Coursera

My favorite place to learn. Thousands of courses, a lot of content on programming, Data Science and Machine Learning. The University of Michigan has many courses here for python programming from the very basics to complex things. All courses are free to audit, you only pay if you want to earn a certificate.

Freecodecamp.org

The top free place to learn to code. Hundreds of hours of free videos on almost any language. They now also have certifications, also for free.

YouTube

The place to learn anything. All of it is free, it might take a while to get to the content you want and enjoy.

Kaggle

Top site for data science, also run many competitions. They have many free courses, but the programming part is scarce, some basic ones and all focused on Data Science and Machine Learning.

DataQuest

Similar to Codecademy, with many paths and courses. Some free content, the rest is paid. Very focused on Data Science.

Codewars

My favorite place to practice code, challenges for every level from beginners to advanced. This is a good place to challenge yourself and check your progress.