Simple, because most of existing structured data in the world right now is stored in Relational Databases, you know, the stuff we know as tables that relate to each other, and most of those tables designed, created and run-on Structured Query Language, or SQL (also known as “seequel” for those of us who don’t like acronyms or spelling).
I know, shocking, after so many times converting .csv files into dataframes for learning, suddenly we find out the world is not filled with .csv files ripe for use (although many companies are drowning in Excel files). But it is the way it is. Most of the time to get the datasets you need, you will have to query some kind of SQL system, and there are plenty of them, from MySQL, PostgreSQL, SQLite, Oracle, and many more. In short, learning SQL is not really optional, you don’t need to become a high end expert, but should learn enough so you can query the data you need, join tables the right way and, if possible, integrate your SQL queries into your python code, there are many great ways to do this especially if you work on Jupyter Notebooks.
There it is the shortest article in history regarding why you should learn SQL as to become a great Data Scientist.
Yet, the Shakespearian question from the title was not: “To SQL or not to SQL”, the answer to this one is simple, DO SQL. The actual question was more of a statement, to SQL AND NoSQL (see, I even wrote the AND with caps, a commonplace way to write SQL commands). Sorry, I digress. Apart from learning SQL, you should also start learning some NoSQL.
NoSQL refers to any and all types of databases, which do not rely on Structured Query Language (SQL). Mostly because they deal with other data structures which are not Relational Databases. Depending on the work you are doing or will be doing as a Data Scientist, you will often find yourself working with many kinds of data, which are not tables, and sometimes, they are not even structured at all. Imagine for example audio, video and image files. They may have some tabled structured, like labels, timestamps and similar data, but the actual files, the audio waves, the image pixels and the video frames, are not structured.
Also, there are many very interesting ways to save and manipulate data that do not require the same old boring table structure (and I am a big fan of tables). Here are a few examples:
Key Value Pairs
Remember the dictionary python structure? Did you pay attention to the .json file format? Well, these are all key value pair structures, and they can be managed in a Key-Value Database such as Apache Ignite, DynamoDB and others. Because of how they are structured that can run faster and more efficiently than a Relational Database, depending on how they are used. If you like working with .json files or dictionaries, this might be a great type of NoSQL for you.
These are great systems for managing semi-structured data, like articles, videos, audios, etc. They can store objects together with metadata that helps us understand the type and qualities of the object. The HUGE advantage of these types of databases is that you do not need to build a previous programmer defined structure. Each object with it’s metadata is stored in a single instance and can be completely different from every other object in stored. This allows for flexibility and ease of use (as long as the metadata is well understood to help retrieve relevant objects). Some examples of Document-Oriented Databases are CouchDB and MongoDB
Wide Column Databases
And now come the Wide Column Stores, they use tables, with rows and columns and, wait, what??? How is this different from, you know, a Relational Database table? Hold on, I am not finished. Yes. It may look like this is the same, but they are not. Wide Column Stores allow for the names and format of the columns in each row can change within the same table. I know, it sounds complicated, and I only have a few lines to talk about this (plus I am nowhere near an expert, or even adept at using them), but they are interesting for very large amounts of data coming in a high velocity. If you want to learn more, check out Cassandra and Bigtable, two great examples of Wide Column Databases.
The last NoSQL system I want to mention is the Graph Database. This one I like because it helps reshape and break some boundaries about how to manage and use data. Imagine using graphs, you know, that data structure with nodes and edges, as the actual inputs to your Data Science projects and Machine Learning models, you could for example, understand how social network connections work, predict who will become who’s friend on Facebook and even understand relationship patterns between people within a company. The Graph Database let’s you do just that. Some great examples of Graph Databases are Neo4J and Oracle Spatial and Graph.
Now we are done. Learning SQL is vital to anyone working to become a Data Scientist, but there is a LOT of life beyond SQL, and new NoSQL systems are becoming more and more relevant. Learn your Structured Query Language and learn the NoSQL systems you like and might want to work with in the future. It will be a great experience and make you a better Data Scientist.
Hope we cross paths through our Journeys…
Jack Raifer Baruch
Next Story: The Probability of Learning Probability
About the Road to Data Science Series
Today, I am working on the first steps of remarkably interesting projects for human development based on Data Science and Machine Learning.
But not that long ago (really, not long at all) I knew extraordinarily little about data science and much less what it all meant (and I am still learning more and more about it every day). In my quest for reinventing myself from Psychologist working in Behavioral Economics to Data Scientist I went through an incredibly interesting journey and learned a lot. This series is mostly a letter to my past self, to help anyone like me take this amazing road and, luckily, avoid some of the mistakes I made on the way due to lack of knowledge or perspective.
Hope you enjoy my ramblings as much as I found joy on my Road to Data Science.
Need Help on your Journey?
Learning Resources I have Used:
A LOT of content, some free, most paid. Check out cupon sites where you can usually find free cupons for courses on python, R, data science, machine learning and much more.
Interesting place to learn, they have some free courses and then paid content. Very hands on coding exercises, few videos, mostly reading.
My favorite place to learn. Thousands of courses, a lot of content on programming, Data Science and Machine Learning. The University of Michigan has many courses here for python programming from the very basics to complex things. All courses are free to audit, you only pay if you want to earn a certificate.
The top free place to learn to code. Hundreds of hours of free videos on almost any language. They now also have certifications, also for free.
The place to learn anything. All of it is free, it might take a while to get to the content you want and enjoy.
Top site for data science, also run many competitions. They have many free courses, but the programming part is scarce, some basic ones and all focused on Data Science and Machine Learning.
Similar to Codecademy, with many paths and courses. Some free content, the rest is paid. Very focused on Data Science.
My favorite place to practice code, challenges for every level from beginners to advanced. This is a good place to challenge yourself and check your progress.