How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr – Part I

When you’re working with large universes of stock data you’ll come across a lot of challenges:

Stocks pay dividends and other distributions that have to be accounted for.
Stocks are subject to splits and other corporate actions which also have to be accounted for.
New stocks are listed all the time – you won’t have as much history for these stocks as for other stocks.
Stocks are delisted, and many datasets do not include the price history of delisted stocks
Stocks can be suspended or halted for a period of time, leading to trading gaps.
Companies grow and shrink: the “top 100 stocks by market cap” in 1990 looks very different to the same group in 2020; “growth stocks” in 1990 look very different to “growth stocks” in 2020 etc.

The challenges are well understood, but dealing with them is not always straightforward.

One significant challenge is gaps in data.

Quant analysis gets very hard if you have missing or misaligned data.

If you’re working with a universe of 1,000 stocks life is a lot easier if you have an observation for each stock for each trading date, regardless of whether it actually traded that day. That way:

you can always do look-ups by date
any grouped aggregations or rolling window aggregations will be operating on the date range for every ticker
you can easily sense check the size of your data to have trading_days * number_of_stocks rows.

If you work with “wide” matrix-like data, these challenges are obvious because you have one row for every date in your data set, and the columns represent an observation for each ticker.

We usually work with long or “tidy” data – where each observation is an observation for a stock for a given day.

How do we work productively in this data, whilst still ensuring that we fill in any gaps in our long data with NAs?

The tidyverse makes this very straightforward. Let me show you!

First, here’s some dummy data to illustrate the problem:

library(tidyverse)
testdata <- tibble(date = c(1,1,2,2,2,3,3),
                       ticker = c('AMZN','FB','AMZN','FB','TSLA','AMZN','TSLA'),
                       returns = 1:7 / 100)
testdata

## # A tibble: 7 x 3
##    date ticker returns
##   <dbl> <chr>    <dbl>
## 1     1 AMZN      0.01
## 2     1 FB        0.02
## 3     2 AMZN      0.03
## 4     2 FB        0.04
## 5     2 TSLA      0.05
## 6     3 AMZN      0.06
## 7     3 TSLA      0.07

TSLA is missing from date 1 as it only started trading after the others
FB is missing from date 3 as it was put on trading halt after Citron Research hacked into Zuck’s memory banks

Ideally we want a row for every date for every stock – with returns set to NA in the case where data is missing.

That way we can always look up a price by date. And we can always be sure that any grouped operations by ticker return the same size data set.

Turns out that the tidyr::complete function is exactly what we’re looking for. It turns implicit missing values – like the returns for TSLA on date 1 and FB on date 3 – into explicit missing values:

tidydata <- testdata %>%
  complete(date, ticker)
tidydata

## # A tibble: 9 x 3
##    date ticker returns
##   <dbl> <chr>    <dbl>
## 1     1 AMZN      0.01
## 2     1 FB        0.02
## 3     1 TSLA     NA   
## 4     2 AMZN      0.03
## 5     2 FB        0.04
## 6     2 TSLA      0.05
## 7     3 AMZN      0.06
## 8     3 FB       NA   
## 9     3 TSLA      0.07

Easy!

Now we have a row for every date for every stock.

Now we can safely do grouped aggregations by ticker, on the understanding that the data is the same size for all tickers, and we’ve removed one large source of potential analysis mishap…

tidydata %>%
  group_by(ticker) %>%
  summarise(count = n())

## # A tibble: 3 x 2
##   ticker count
##   <chr>  <int>
## 1 AMZN       3
## 2 FB         3
## 3 TSLA       3

Visit Robot Wealth website for additional insight on this topic and to download the complete set of scripts: https://robotwealth.com/how-to-fill-gaps-in-large-stock-data-universes-using-tidyr-and-dplyr/

Past performance is not indicative of future results.

Any stock, options or futures symbols displayed are for illustrative purposes only and are not intended to portray recommendations.

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from Robot Wealth and is being posted with its permission. The views expressed in this material are solely those of the author and/or Robot Wealth and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr – Part I

Posted August 10, 2021

Disclosure: Interactive Brokers

IBKR Campus Newsletters

Interactive Brokers Canada Inc.

Interactive Brokers Australia Pty. Ltd.

Interactive Brokers Hong Kong Limited

Interactive Brokers India Pvt. Ltd.

Interactive Brokers Securities Japan Inc.

Interactive Brokers Singapore Pte. Ltd.

Related Tags

Disclosure: Interactive Brokers

IBKR Campus Newsletters