How I Turned My Coffee Addiction into a Ruby and Postgres Adventure

ยท

5 min read

Introduction

As a coffee lover and a bit of a developer, I decided to create some data manipulation using a dataset on Coffees Reviews I found online on Kaggle.

I created a Rails app and wanted to import the data from the CSV file I found into my Postgres database. There are plenty of gems available for this task, but I wanted to try using SQL.

What we want to do :

Image description

SQL is like a superhero tool with incredible power! And I really want to level up my skills which is why I decided to do it in SQL.

And there I go I began my journey by learning about PL/PGSQL, ETL, and rake tasks.

PL/PGSQL

Procedural languages (PL) come in different forms for different databases. In Postgres databases, one such procedural language is PL/PGSQL. However, Postgres also supports other procedural languages like PL/Ruby, PL/Python, and more.

ETL what is it?

ETL stands for Extract, Transform, and Load. It is a process that involves extracting data from various sources (here o transforming it into a format that can be easily analyzed and loading it into a destination system. In this article, the datasources is the CSV that we found on Kaggle and the destination system is our Postgres Database.

So let's get hands dirty

Setup

First, create I have created my rails project :

rails new coffee_app --database=postgres

Then add your sidekiq gem :

gem 'sidekiq'

And then run bundle in your app directory in your shell:

bundle

Sidekiq needs Redis to work, so you will need to execute it. My personal preference is to run it in a docker container like this:

docker pull redis
docker run -p 6379:6379 --name redis-container redis

Then you can run sidekiq

bundle exec sidekiq

Now that we've got our battle gear on, let's dive into the coding trenches and show SQL who's boss!

Asynchronous ETL with Sidekiq

You can create a job with the rails cli easily like this:

rails g job CoffeeEtlJob

Then add the following code to your job :

class CoffeeEtlJob < ApplicationJob
  queue_as :default

  def perform

    query_insert_into_coffee = <<-SQL
    CREATE FUNCTION insertintocoffees() RETURNS TRIGGER AS $example_table_coffee$
    BEGIN
      INSERT INTO coffees (name,price,description,origin,created_at,updated_at) VALUES (
        NEW.name,
        NEW.price,
        NEW.review,
        NEW.origin,
        NOW(),
        NOW());
      RETURN NEW;
    END;
    $example_table_coffee$ LANGUAGE plpgsql;
    SQL

    query_insert_into_roaster = <<-SQL
      CREATE FUNCTION insertintoroasters() RETURNS TRIGGER AS $example_table_roaster$
      BEGIN
          INSERT INTO roasters (name,location,created_at,updated_at) VALUES (
            NEW.roaster,
            NEW.loc_country,
            NOW(),
            NOW());
          RETURN NEW;
        END;
      $example_table_roaster$
    SQL


    create_triggers = <<-SQL
      CREATE TRIGGER etl_coffees AFTER INSERT ON import_coffee FOR EACH ROW EXECUTE PROCEDURE insertintocoffees();
      CREATE TRIGGER etl_roasters AFTER INSERT ON import_coffee FOR EACH ROW EXECUTE PROCEDURE insertintoroasters();
    SQL

    copy_data = <<-SQL
    COPY import_coffee
      FROM 'path/to/your/csv'
      DELIMITER ',' CSV HEADER
    SQL

    ActiveRecord::Base.transaction do
      ActiveRecord::Base.connection.execute(query_insert_into_coffee)
      ActiveRecord::Base.connection.execute(query_insert_into_roaster)
      ActiveRecord::Base.connection.execute(create_triggers)
      ActiveRecord::Base.connection.execute(copy_data)
    end

  end
end

Break things up

The queries that create functions and return a triggers, insert data in the table of the database. Like this one :

    CREATE FUNCTION insertintocoffees() RETURNS TRIGGER AS $example_table_coffee$
    BEGIN
      INSERT INTO coffees (name,price,description,origin,created_at,updated_at) VALUES (
        NEW.name,
        NEW.price,
        NEW.review,
        NEW.origin,
        NOW(),
        NOW());
      RETURN NEW;
    END;
    $example_table_coffee$ LANGUAGE plpgsql;

After a trigger is fired in response to an event it will insert into the database, and the values for the column are provided thanks to the NEW keyword.

But how do we execute this? Thanks to the definition of the trigger like this :

CREATE TRIGGER etl_coffees 
       AFTER INSERT ON import_coffee 
       FOR EACH ROW EXECUTE PROCEDURE insertintocoffees();

This will execute your insertintocoffees function with each row of our CSV file. Nonetheless, it will be triggered first after INSERT ON import_coffee. For the import_coffee insertion, we need this :

 COPY import_coffee 
     FROM 'path/to/your/csv' 
     DELIMITER ',' CSV HEADER

This will copy the data from our csv to the import_coffee table.

Putting all together

  1. We create functions that will copy the data in a good place and transform it if necessary.

  2. We create a trigger that will execute our function for each row of our new table after the insert.

  3. We launch our copy which will trigger our functions when all the data are in the import_coffee table.

Rake Task

Rake is a build tool written in Ruby. It is often used in Ruby on Rails projects to automate repetitive tasks, such as running tests, migrating databases, or deploying code to a server. It looks like the rails runner I've spoken about in this article

Rake uses a domain-specific language (DSL) to define tasks and dependencies between tasks, and provides a command-line interface for executing those tasks.

In a Rakefile (which is a file that defines tasks for Rake), tasks are defined using the task method.

You can generate a rake task with the rails cli.

rails g task etl populate

This will generate a task file called etl.rake in the lib/tasks directory of your app. Then you can add the following line to it :

namespace :etl do
  desc "Populate the database with data from the ETL"
  task populate: :environment do
    CoffeeEtlJob.perform
  end
end

And you're good, eventually, you run your rake task like this:

rake etl:populate

When you run the rake etl:populate command, the CoffeeEtlJob will be executed as soon as possible. However, make sure that your Sidekiq server is running.

Keep in mind that although it's easy to create an ETL process in Ruby using PL/PGSQL and Sidekiq, there are still important steps to take before it's production-ready. (Where are the test ?๐Ÿ™ˆ).

Conclusion

It is pretty easy to run an ETL in ruby. SQL and Ruby are really powerful, I know it seems that it is not a natural choice for data manipulation. At the end of the day, the most important is that you took pleasure along the way.

And I hope you have learned new stuff or used thing that makes you happy. Boring Technologies or not Ruby, Rails, and SQL are mighty and are genuinely a pleasure to use in my opinion.

Keep in Touch

On Twitter : [@yet_anotherDev] (https://twitter.com/yet_anotherDev)

ย