Close Navigation
Learn more about IBKR accounts
Running R Code in Parallel

Running R Code in Parallel

Posted January 8, 2024
Andrew Treadway
TheAutomatic.net

Background

Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time. Thankfully, running R code in parallel is relatively simple using the parallel package. This package provides parallelized versions of sapplylapply, and rapply.

Parallelizing code works best when you need to call a function or perform an operation on different elements of a list or vector when doing so on any particular element of the list (or vector) has no impact on the evaluation of any other element. This could be running a large number of models across different elements of a list, scraping data from many webpages, or a host of other activities.

Testing for Primes in Parallel

In the example below, we’re going to use the parallel package to loop over 1 million integers to test whether each of them is a prime (or not). If you were doing this without the parallel package, you might try to speed up the operation by using sapply (rather than a for loop). This is fine, but the drawback is that sapply will only be able to test each number in the set one at a time. Using the parallelized version of sapply, called parSapply in the parallel package, we can test multiple numbers simulatenously for primality.

# load parallel package
require(parallel)
 
# define function to test whether an number is prime
is_prime <- function(num)
{
    # if input equals 2 or 3, then we know it's prime
    if(num == 2 | num == 3) 
      return(TRUE)
    # if input equals 1, then we know it's not prime
    if(num == 1) 
      return(FALSE)
   
    # else if num is greater than 2
    # and divisible by 2, then can't be even
    if(num %% 2 == 0) 
      return(FALSE)
   
    # else use algorithm to figure out
    # what factors, if any, input has
     
    # get square root of num, rounded down
    root <- floor(sqrt(num))
     
    # try to divide each odd number up to root
    # into num; if any leave a remainder of zero,
    # then we know num is not prime
    for(elt in seq(5,root))
    {
        if (num %% elt == 0)
          return(FALSE)
       
    }
    # otherwise, num has no divisors except 1 and itself
    # thus, num must be prime
    return(TRUE)
   
}
 
# get random sample of 1 million integers from integers between 1 and 
# 10 million
# set seed so the random sample will be the same every time
set.seed(2)
sample_numbers <- sample(10000000, 1000000)
 
 
# do a couple checks of function
is_prime(17) # 17 is prime
 
is_prime(323) # 323 = 17 * 19; not prime
 
# create cluster object
cl <- makeCluster(3)
 
# test each number in sample_numbers for primality
results <- parSapply(cl , sample_numbers , is_prime)
 
# close
stopCluster(cl)

The main piece of the code above is this:

# create cluster object
cl <- makeCluster(3)
 
# test each number in sample_numbers for primality
results <- parSapply(cl , sample_numbers , is_prime)
 
# close cluster object
stopCluster(cl)

The makeCluster function creates a cluster of R engines to run code in parallel. In other words, calling makeCluster creates multiple instances of R. Passing the number 3 as input to this function means three separate instances of R will be created. If you’re running on Windows, you can see these instances by looking at running processes in the Task Manager.

After this cluster is created, we call parSapply, which works almost exactly like sapply, except that instead of looping over each element in the vector, sample_numbers, one at a time, it uses the cluster of R instances to test multiple numbers in the vector for primality simultaneously. As you’ll see a little bit later, this saves a nice chunk of time.

Once our operation is done, we close the cluster object using the stopCluster function. This is important to do each time you use the parallel package; otherwise you could end up with lots of R instances on your machine.

How fast is running R code in parallel?

Alright, so let’s test how much time we can save by parallelizing our code. We’ll start by running the same is_prime function above on the same list of 1 million integers using regular sapply â€” so no parallelization. We will time the operational execution by using R’s builtin function, proc.time, before and after we run sapply; this gives us a time stamp at the start of the code run and at the end, so we can subtract these to see how much time it took for our code to run.

start <- proc.time()
results <- sapply(sample_numbers , is_prime)
end <- proc.time()
 
print(end - start) # 125.34

So the code takes 125.34 seconds to run.

start <- proc.time()
cl <- makeCluster(2)
results <- parSapply(cl , sample_numbers , is_prime)
stopCluster(cl)
 
end <- proc.time()
 
print(end - start) # 70.01

As you can see, using just two cores has lessened the amount of run time down to 70.01 seconds! What if we use three cores, like in our initial example?

start <- proc.time()
cl <- makeCluster(3)
results <- parSapply(cl , sample_numbers , is_prime)
stopCluster(cl)
 
end <- proc.time()
 
print(end - start) # 47.81

Using three cores runs our process in 47.81 seconds, which is much faster than using regular sapply. The exact amount of time you’ll save using parallelization will vary depending upon what operations you’re performing, and on the processor speed of the machine you’re working on, but in general, parallelization can definitely increase efficiency in your code. Creating a cluster of R processes, as well as merging together results from those instances, does take some amount of time. This means that parallelizing code over a small list or vector may not be worth it if the computation involved is not very intensive. However, in the case above of involving a larger vector of numbers, parallelization helps immensely.

How many parallelized instances should we use?

Above, we tested using 2 and 3 cores, respectively. But why not some other amount? The number of cores we should use is related to the number of cores on your machine. A good rule of thumb is to generally not exceed this number. There are exceptions, but often, creating more processes than cores will end up slowing down a computation, rather than increasing the speed. This has to do with how an operating system handles multiprocessing. For a more detailed explanation, see this link.

To figure out how many cores your machine has, you can run the detectCores function from the parallel package:

detectCores()

You may also want to balance the number of cores you use with other computations or applications you have running on your machine simultaneously.

That’s the end for this post. Have fun coding!

Originally posted on TheAutomatic.net blog.

Join The Conversation

If you have a general question, it may already be covered in our FAQs. If you have an account-specific question or concern, please reach out to Client Services.

Leave a Reply

Your email address will not be published. Required fields are marked *

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.