Close Navigation
Learn more about IBKR accounts
Vectorize Fuzzy Matching

Vectorize Fuzzy Matching

Posted December 21, 2023
Andrew Treadway
TheAutomatic.net

One of the best things about R is its ability to vectorize code. This allows you to run code much faster than you would if you were using a for or while loop. In this post, we’re going to show you how to use vectorization to speed up fuzzy matching. First, a little bit of background will be covered. If you’re familiar with vectorization and / or fuzzy matching, feel free to skip further down the post.

What is vectorization?

Vectorization works by performing operations on entire vectors, or by extension, matrices, rather than iterating through each element in a collection of objects one at a time. A basic example is adding two vectors together. This can be done like this:

a <- c(3, 4, 5)
b <- c(6, 7, 8)
 
sums <- a + b

In this example, sums now equals c(9, 11, 13). This is because the addition operator, “+”, was applied pairwise to each vector i.e. the first element of a was added to the first element of b, the second element of a was added to the second element of b, and so on. Vectorization is faster than traditional for loops because it uses parallel operations under the hood. For loops in R, on the other hand, are notoriously slow.

This example may seem simple, but vectorization can be used much more powerfully to speed up a process like fuzzy matching, the topic of this article.

What is fuzzy matching?

Fuzzy matching is the process of finding strings that follow similar patterns. For example, suppose you’re trying to join two data sets together on a city field. Data for these cities may be entered into a system manually, allowing for spelling or formatting differences i.e. “Mt. Hood” might be coded as “Mount Hood” etc. For a human, we can see those are clearly the same, but we need an algorithmic way to determine that so we don’t have to manually go through large numbers of possibilities.

The stringdist package

To perform fuzzy matching, we’re going to use a package called stringdist. This contains a function we need called stringsim which gives a measure of similarity between a pair of strings. This function allows several different algorithms to compare the similarity between two strings, and returns a value between 0 (very dissimilar) and 1 (very similar or equal). For instance:

# load stringdist package
require(stringdist)
 
# compare two sample strings
stringsim("this is a test", "this is the test", method = "jw")

Simplistic Approach

Alright, so to illustrate how much time vectorization saves, let’s do fuzzy matching in a less clever way. We’re going to use the zipcode package to get a sample list of cities across the US. These get stored in the vector, cities. Note, since this data set is actually at a zip code level, we’re going to have some duplicates in our vector, but that’s not a concern for this example because we’re trying to demonstrate performance differences between vectorization versus non-vectorization. Thus, our cities vector contains 44,336 elements.

Also, we create a vector below, called input, containing the misspelled or adjusted spellings of several cities in Florida (“Mt. Dora” vs. “Mount Dora”, “Sun City, FL” vs. “Sun City” etc.).

# load packages
require(stringdist)
require(zipcode)
data("zipcode")
 
cities <- zipcode$city
 
# misspell / change spellings from a few Florida cities
input <- c("Centry", "Boca R.", "Mt. Dora", "winterhaven", "Sun City, FL")

Now, one way of doing fuzzy matching here would be to loop through each city in our input vector, and then loop through each city in the cities vector and check the string similarity between each possible pair of strings i.e. check the similarity between the first element in input against every single element in cities. Since the length of input is 5 and the length of cities is 44,336, this requires 5 * 44,336 = 221,680 calls to stringsim.

# Naive approach
start <- proc.time()
results <- c()
for(city in input)
{
   
    best_dist <- -Inf
    for(check in cities)
    {
        dist <- stringsim(city, check, method = "jw")
        if(dist > best_dist)
        {
            closest_match <- check
            best_dist <- dist
           
        }
       
    }
     
    results <- append(results, closest_match)
   
}
end <- proc.time()

If we run end – start, we find that this process takes just under 26 seconds. However, let’s see what happens when we ditch the for loops, and use vectorization instead.

Vectorizing Fuzzy Matching

# define function to search for best string match in cities
get_best_match <- function(city, cities = cities)
{
     
    max_index <- which.max(stringsim(city, cities, method = "jw"))
     
    return(cities[max_index])
   
}
 
 
vector_start <- proc.time()
vector_results <- sapply(input, function(city) get_best_match(city, cities))
vector_end <- proc.time()

If we run vector_end – vector_start, we see this only takes 0.03 seconds! The main reason for the speed up can be seen in this line:

max_index <- which.max(stringsim(city, cities, method = "jw"))

The inner function, stringsim, gets the string similarity between the input city (e.g. “Mt. Dora”) versus every element in the cities vector. This is done in one line, rather than using a for or while loop construct. The output of this function call is a vector that shows the similarity measures (between 0 and 1) of the input city versus everything in cities.

We then apply the which.max function to get the index of the element in cities that is the closest match to the input city. Using this index, we can get the actual name of the city with the closest match, like below.

return(cities[max_index])

The other reason for the performance increase is using sapply to actually loop over the elements in our input vector, passing element in turn to the get_best_match function. Since sapply is written in C under the hood, it usually runs much faster than a traditional for loop in R.

That’s it for this post! Check out other posts of mine here: http://theautomatic.net/blog/.

Originally posted on TheAutomatic.net blog.

Join The Conversation

If you have a general question, it may already be covered in our FAQs. If you have an account-specific question or concern, please reach out to Client Services.

Leave a Reply

Your email address will not be published. Required fields are marked *

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.