Reversing DidYouMean Gem
Introduction
Let's dive into ruby! There is a really useful gem that we do not notice but is kind of the savior for us developers.
There is no suspense by the title it is obvious I talk about the DidYouMean gem which is now shipped into ruby directly.
So I wanted to dig into this Gem since sometimes it can feel a little bit magic. But any enough advanced technologies can feel like magic right?
Which Spell?
So first I was like how can ruby suggests to me some methods or attribute that I maybe mistypes else? I was like did he just guess it, is there a complex AI behind this?
No not at all just good ol' ruby code. But at least how can he guess which methods are in my object or things like that? Classes and object can be introspected you can get a look inside them and see what are inside them at any time.
For example, if you want to know the methods in the Object class you can do this in IRB :
irb(main):001:0> Object.methods
It will give all the methods that the Object class has. In the same way, if you have an object you call methods on it you will have the methods of your object.
Okay, cool now what, how can ruby guess the right methods if I made a typo or even just write methods instead of methods for example?
Black Magic aka Maths
Promise it won't be long and I will not dig into the algorithm it is not the purpose of this article. But I need to talk about it. In mathematics, we have some function
which are called distances. They do what they look to do. They compute distances between "objects".
To compute the distance between 2 "objects" here 2 strings for our DidYouMean Gem there are 2 distances that are used :
JaroWinkler' distance
Levenstein' distance
There are several ways to implement these algorithms but again I will dig into that. Here is a simpler version of the code of the SpellChecker of the DidYouMean gem, I will highlight the most important part and try to decompose it.
That said there is not a lot of simplification done, I have just removed the normalization but it is still understandable.
module DidYouMean
class SpellChecker
def initialize(dictionary:)
@dictionary = dictionary
end
def correct(input)
threshold = input.length > 3 ? 0.834 : 0.77
words = @dictionary.select { |word| JaroWinkler.distance(word, input) >= threshold }
words.reject! { |word| input.to_s == word.to_s }
words.sort_by! { |word| JaroWinkler.distance(word.to_s, normalized_input) }
words.reverse!
# Correct mistypes
threshold = (input.length * 0.25).ceil
corrections = words.select { |c| Levenshtein.distance(normalize(c), normalized_input) <= threshold }
# Correct misspells
if corrections.empty?
corrections = words.select do |word|
length = input.length < word.length ? normalized_input.length : word.length
Levenshtein.distance(word, normalized_input) < length
end.first(1)
end
corrections
end
end
end
Dictionary
First, we have this part :
def initialize(dictionary:)
@dictionary = dictionary
end
We define a dictionary for our spellchecker it is the list of strings that it will search against to 'guess' a correction.
Here the first part of our article take a lot of sense, for example, our dictionary could be the list of methods of our object, we have with a.methods
.
Filtering words from the dictionary
Then we have the correct
methods that I am going to dive into.
We have this first part for the correct methods :
threshold = input.length > 3 ? 0.834 : 0.77
words = @dictionary.select { |word| JaroWinkler.distance(word, input) >= threshold }
words.reject! { |word| input.to_s == word.to_s }
words.sort_by! { |word| JaroWinkler.distance(word.to_s, normalized_input) }
words.reverse!
We have defined a threshold that depends on the length of the input we want to correct. I will come back to these thresholds a bit later and explain why they are here.
Then we have the filtering of the potential corrections of our word. We do that by computing the JaroWinkler distance of the input in the word in the dictionary.
So now why do we need these thresholds? When we compute the distance it will give us a score of 'proximity' of the compared strings. But from what I have seen in this commit with an input of size less than 3 the JaroWinkler distance does behave badly.
Picking the right correction
Once we have found the closest word that could be a good recommendation we have to suggest the best one.
It is the role of this part :
threshold = (input.length * 0.25).ceil
corrections = words.select { |c| Levenshtein.distance(normalize(c), normalized_input) <= threshold }
We use a different threshold since the Levenstein distance is not normalized. And then we put pick the right one with the Levenstein distance.
Maybe you would be like me and ask yourself why we need both distances.
An assumption could be that JaroWinkler is fastest than Leveinstein but less precise. So we wanted to use JaroWinkler to pre-filter the nearest words and then between these words take the best suggestions.
Of course, it is only an assumption and if you have the answer or a better assumption please do not hesitate to share it with us.
Avada Kedavra
For this part, it is only my assumption too but I think it is not much of a guess. The previous computation can eventually give us nothing because the threshold for the Levenstein distance was too small and nothing is as close as the threshold needs.
So we try again with this piece of code :
if corrections.empty?
corrections = words.select do |word|
length = input.length < word.length ?
normalized_input.length : word.length
Levenshtein.distance(word, normalized_input) < length
end.first(1)
end
We have defined a new threshold so we are able to find a suggestion at the end of the day.
Conclusion
This was the first article of I hope a cool series about a deep dive into the Ruby Language.
If you find any things wrong or want to add something please leave a comment.
I will be really happy to read your comments and improve the article if needed.
If you want more explanation about JaroWinkler and Leveinstein distance here are two articles.
Thanks for reading me and see you around!
Keep in Touch
On Twitter : @yet_anotherDev