Python and Polars an Introduction

Introduction

Sometimes you reach a point where you connect dots between different things. Indeed I am currently digging into SQL, and for each new concept that I learned I tried to illustrate it with a ruby project. Here is the thing, I love ruby it is a cool scripting language. But I have to admit it is not the best language for analytical stuff and so on.

Now I have 2 choices: either I continue into ruby and accept the flaws and am ok with it or I learn new stuff and improve my skills in another domain. I chose the second one, I chose to learn Python and Polars to continue my journey in SQL wizardly.

Wait what? Polars is not for SQL. What am I talking about? How can I learn SQL using Polars and Python?

My second point is what is important are the concepts, and there are a lot of concepts (not all of them I suppose) that are common to both SQL and Polars. For example the relational algebra or window function.

This article is an introduction to the use of Polars and Python in order to learn the concepts of SQL.

Baby step

That said before diving into really complex subjects. We should first learn the abc of our API.

So as a first project, I wanted to do some basic analytics on github data that I found on Kaggle. If you are interested here is the dataset.

Like I said let's take Polars gently and see what we can do. Let's discover our dataset by reading the CSV and displaying the first 5 repository name.

import polars as pl

df = pl.read_csv("path/to/github_dataset.csv")

first_five_repo = df.select(
  pl.col("repositories").head(5)
)

print(first_five_repo)

This is the equivalent of this in SQL, if we suppose that our csv is let's say a table name git_repos.

SELECT repositories FROM git_repos LIMIT 5

And it gives us this result :

shape: (5, 1)
┌───────────────────────────┐
│ repositories              │
│ ---                       │
│ str                       │
╞═══════════════════════════╡
│ octocat/Hello-World       │
│ EddieHubCommunity/support │
│ ethereum/aleth            │
│ localstack/localstack     │
│ education/classroom       │
└───────────────────────────┘

It is a good beginning, so let's dive a bit more into polars and let's do a bit more of analytics and reporting.

## A little more advanced stuff

Let's see what the top 5 languages in our csv file, I think you are going to be really surprised :

top_5 = df.groupby("language").agg(
    [pl.count().alias("count")]
  ).sort("count",descending=True).head(5)

print(top_5)

This equivalent to this in PostgreSQL would be something like this

SELECT COUNT(*) AS count FROM git_repos GROUP BY language ORDER BY count DESC;

And eventually we end up with this result :

shape: (5, 2)
┌────────────┬───────┐
│ language   ┆ count │
│ ---        ┆ ---   │
│ str        ┆ u32   │
╞════════════╪═══════╡
│ JavaScript ┆ 253   │
│ Python     ┆ 155   │
│ NULL       ┆ 145   │
│ HTML       ┆ 72    │
│ Java       ┆ 44    │
└────────────┴───────┘

Yes, it seems that NULL is the 3rd favorite language of people around the world 😅.

Let's just filtering out the null values to have a better idea of the language distribution, and it would be ok then :

filtered_language = df.filter(pl.col("language") != "NULL")
top_5 = filtered_language.groupby("language").agg(
    [pl.count().alias("count")]
  ).sort("count",descending=True).head(5)

print(top_5)

Which pretty much I would say is not exactly equivalent to something like one of the following answers.

SELECT COUNT(*) AS count FROM git_repos WHERE language IS NOT NULL GROUP BY language ORDER BY count DESC;

--OR something like that in the idea but a bit more complicated

WITH filtered_language AS (SELECT * FROM git_repos WHERE language IS NOT NULL)
SELECT COUNT(*) AS count FROM filtered_language GROUP BY count ORDER BY DESC

What is cool about using python and Polars to create all of this is that you have some cool concepts of SQL or something near for free. For example, here we have declared a filtered_language value which contains all the columns of our DataFrame but without any NULL.

It is close to the Common Table Expressions in SQL, like with seen in the second way to write this query.

Conclusion

I have to say that I am pleased by the flexibility of Polars, which is close to Pandas of course. It is adding a layer of abstraction above our Relational Algebra. I am not throwing Ruby away or else, but it's true that when you want to do analytics on data it is less natural to use it.

Nevertheless, python and pandas will never replace SQL since it is not for the same usage and keep increasing my SQL skills is still a way to go for me along with Ruby. I will continue my quest in SQL and data, trying to improve my skills and share all this with you people

Keep in Touch

On Twitter : @yet_anotherDev