How I Turned My Coffee Addiction into a Ruby and Postgres Adventure
Introduction
As a coffee lover and a bit of a developer, I decided to create some data manipulation using a dataset on Coffees Reviews I found online on Kaggle.
I created a Rails app and wanted to import the data from the CSV file I found into my Postgres database. There are plenty of gems available for this task, but I wanted to try using SQL.
What we want to do :
SQL is like a superhero tool with incredible power! And I really want to level up my skills which is why I decided to do it in SQL.
And there I go I began my journey by learning about PL/PGSQL, ETL, and rake tasks.
PL/PGSQL
Procedural languages (PL) come in different forms for different databases. In Postgres databases, one such procedural language is PL/PGSQL. However, Postgres also supports other procedural languages like PL/Ruby, PL/Python, and more.
ETL what is it?
ETL stands for Extract, Transform, and Load. It is a process that involves extracting data from various sources (here o transforming it into a format that can be easily analyzed and loading it into a destination system. In this article, the datasources is the CSV that we found on Kaggle and the destination system is our Postgres Database.
So let's get hands dirty
Setup
First, create I have created my rails project :
rails new coffee_app --database=postgres
Then add your sidekiq gem :
gem 'sidekiq'
And then run bundle in your app directory in your shell:
bundle
Sidekiq needs Redis to work, so you will need to execute it. My personal preference is to run it in a docker container like this:
docker pull redis
docker run -p 6379:6379 --name redis-container redis
Then you can run sidekiq
bundle exec sidekiq
Now that we've got our battle gear on, let's dive into the coding trenches and show SQL who's boss!
Asynchronous ETL with Sidekiq
You can create a job with the rails cli easily like this:
rails g job CoffeeEtlJob
Then add the following code to your job :
class CoffeeEtlJob < ApplicationJob
queue_as :default
def perform
query_insert_into_coffee = <<-SQL
CREATE FUNCTION insertintocoffees() RETURNS TRIGGER AS $example_table_coffee$
BEGIN
INSERT INTO coffees (name,price,description,origin,created_at,updated_at) VALUES (
NEW.name,
NEW.price,
NEW.review,
NEW.origin,
NOW(),
NOW());
RETURN NEW;
END;
$example_table_coffee$ LANGUAGE plpgsql;
SQL
query_insert_into_roaster = <<-SQL
CREATE FUNCTION insertintoroasters() RETURNS TRIGGER AS $example_table_roaster$
BEGIN
INSERT INTO roasters (name,location,created_at,updated_at) VALUES (
NEW.roaster,
NEW.loc_country,
NOW(),
NOW());
RETURN NEW;
END;
$example_table_roaster$
SQL
create_triggers = <<-SQL
CREATE TRIGGER etl_coffees AFTER INSERT ON import_coffee FOR EACH ROW EXECUTE PROCEDURE insertintocoffees();
CREATE TRIGGER etl_roasters AFTER INSERT ON import_coffee FOR EACH ROW EXECUTE PROCEDURE insertintoroasters();
SQL
copy_data = <<-SQL
COPY import_coffee
FROM 'path/to/your/csv'
DELIMITER ',' CSV HEADER
SQL
ActiveRecord::Base.transaction do
ActiveRecord::Base.connection.execute(query_insert_into_coffee)
ActiveRecord::Base.connection.execute(query_insert_into_roaster)
ActiveRecord::Base.connection.execute(create_triggers)
ActiveRecord::Base.connection.execute(copy_data)
end
end
end
Break things up
The queries that create functions and return a triggers, insert data in the table of the database. Like this one :
CREATE FUNCTION insertintocoffees() RETURNS TRIGGER AS $example_table_coffee$
BEGIN
INSERT INTO coffees (name,price,description,origin,created_at,updated_at) VALUES (
NEW.name,
NEW.price,
NEW.review,
NEW.origin,
NOW(),
NOW());
RETURN NEW;
END;
$example_table_coffee$ LANGUAGE plpgsql;
After a trigger is fired in response to an event it will insert into the database, and the values for the column are provided thanks to the NEW keyword.
But how do we execute this? Thanks to the definition of the trigger like this :
CREATE TRIGGER etl_coffees
AFTER INSERT ON import_coffee
FOR EACH ROW EXECUTE PROCEDURE insertintocoffees();
This will execute your insertintocoffees function with each row of our CSV file. Nonetheless, it will be triggered first after INSERT ON import_coffee. For the import_coffee insertion, we need this :
COPY import_coffee
FROM 'path/to/your/csv'
DELIMITER ',' CSV HEADER
This will copy the data from our csv to the import_coffee table.
Putting all together
We create functions that will copy the data in a good place and transform it if necessary.
We create a trigger that will execute our function for each row of our new table after the insert.
We launch our copy which will trigger our functions when all the data are in the import_coffee table.
Rake Task
Rake is a build tool written in Ruby. It is often used in Ruby on Rails projects to automate repetitive tasks, such as running tests, migrating databases, or deploying code to a server. It looks like the rails runner I've spoken about in this article
Rake uses a domain-specific language (DSL) to define tasks and dependencies between tasks, and provides a command-line interface for executing those tasks.
In a Rakefile (which is a file that defines tasks for Rake), tasks are defined using the task method.
You can generate a rake task with the rails cli.
rails g task etl populate
This will generate a task file called etl.rake
in the lib/tasks directory of your app. Then you can add the following line to it :
namespace :etl do
desc "Populate the database with data from the ETL"
task populate: :environment do
CoffeeEtlJob.perform
end
end
And you're good, eventually, you run your rake task like this:
rake etl:populate
When you run the rake etl:populate command, the CoffeeEtlJob will be executed as soon as possible. However, make sure that your Sidekiq server is running.
Keep in mind that although it's easy to create an ETL process in Ruby using PL/PGSQL and Sidekiq, there are still important steps to take before it's production-ready. (Where are the test ?๐).
Conclusion
It is pretty easy to run an ETL in ruby. SQL and Ruby are really powerful, I know it seems that it is not a natural choice for data manipulation. At the end of the day, the most important is that you took pleasure along the way.
And I hope you have learned new stuff or used thing that makes you happy. Boring Technologies
or not Ruby, Rails, and SQL are mighty and are genuinely a pleasure to use in my opinion.
Keep in Touch
On Twitter : [@yet_anotherDev] (https://twitter.com/yet_anotherDev)