Getting Started with String Distance Metrics

Sat, Aug 15, 2020 4-minute read

I took a little digression from Python and NLTK when I discovered the R package stringdist, which has a collection of functions for string distance calculation and approximate matching based on those metrics. This is one of the main reasons I started learning NLP in the first place, and stringdist, like a lot of R packages, has good documentation including a paper about the package and underlying metrics. There was no better reason for me to start diving in right away!

So, here I am sharing some information that I learned.

Context

If you are typing in a search box, the worst (working) implementation would search for an exact match of what you are typing: the tiniest typo would totally blow your search query. But search implementations should be “smarter” than that: accounting for some margin of error is essential whenever the user input is free text. Part of the solution is attributed to approximate (aka fuzzy) string matching, and this is where a package like stringdist would come in handy.

Side note: this is not really cutting edge. Algorithms in this package are a level-up from exact matching, but are still simpler than machine learning models that are used in actual search engines and might include other features to improve search results.

Types of distance metrics

According to the paper, there are 3 major types of algorithms for calculating distance. I will glance over them in this section.

Edit-based distance

Algorithms in this category calculate the distance between two strings based on the number of edits of one string that are required to match another string. These edits can be:

  • insertions ("fo" -> "for"),
  • deletions ("care" -> "car"),
  • substitutions ("fuzz" -> "buzz"), or
  • transpositions ("pat" -> "apt").

The algorithms differ in whether they allow only one, some, or all of these edits. For example, the longest common substring (LCS) method only allows insertions and deletions. They also differ in the range of possible values.

stringdist("foo",  "boo",   method = "lcs") # 2

q-gram distance

In the context of such algorithms, q-gram is simply a string of length q. So, a q-gram-based method takes its input strings and computes corresponding q-grams. For example, given q = 2, "fizz" would produce 3 q-grams ("fi", "iz", and "zz"). The calculations are then performed on these q-gram sets.

One q-gram-based method is the Jaccard method which calculates the proportion of uncommon q-grams. You could guess that it ranges from 0 to 1.

stringdist("leia", "leela", method = "jaccard", q = 2) # 0.8333

Heuristic distance

Two methods were mentioned in the paper: the Jaro distance and its extension, the Jaro-Winkler distance. These methods are actually developed to account for typing errors using the logic that typing errors result in character substitutions or transpositions between adjacent characters. The further the matched characters are from each other, the larger the penalty added to the distance metric.

They range from 0 to 1, where 0 is a perfect match.

stringdist("comonality", "commonality", method = "jw") # 0.0303

Application

The use of such methods is mainly a lookup function: input string(s) is searched into another set, the dictionary, for matches. A function like stringdist::amatch() calculates the distance metric of interests and picks up the best match. Also, it is straightforward to write a function that gets the best n matches along with the calculated metric if you wish.

It is worth noting that different metrics produce different results, and such difference might be huge. So, the choice should be consciously made based on the type of text being analyzed. Also, since they don’t have the same ranges of possible values, they are not directly comparable.


That sums up what I have learned about string distance and approximate matching from reading the stringdist package paper. From here I will probably get back to progressing in Python and NLTK to continue learning.