Working with Tidy Financial Data in tidyr

Articles From: Robot Wealth
Website: Robot Wealth

Holding data in a tidy format works wonders for one’s productivity.

Here we will explore the tidyr package, which is all about creating tidy data.

In particular, let’s develop an understanding of the tidyr::pivot_longer and tidyr::pivot_wider functions for switching between different formats of tidy data.

What’s tidy data?

Tidy data is data where:

  • Every column is variable.
  • Every row is an observation.
  • Every cell is a single value.

Why do we care?

It turns out there are huge benefits to thinking about the “shape” of your data and the best way to structure and manipulate it for your problem.

Tidy data is a standard way of shaping data that facilitates analysis. In particular, tidy data works very well with the tidyverse tools. Which means less time spent transforming and cleaning data and more time spent solving problems. In short, structuring data to facilitate analysis is an enormous productivity hack.

Thinking in these terms has had a MASSIVE impact on the effectiveness and speed of our research. We’re going to cover this in some detail in the Armageddon bootcamp, along with some specific patterns for doing trading analysis.

Wide vs Long Data

Let’s take a look at some long-format data. This code loads a long dataframe of daily returns for various indexes and prints the observations from the beginning of March 2020:

if (!require(“pacman”)) install.packages(“pacman”)
pacman::p_load(tidyverse, here, knitr, kableExtra)

load(here(‘data’, ‘indexreturns.RData’))
dailyindex_df <- dailyindex_df %>%
filter(date >= ‘2020-03-01’)

dailyindex_df %>%
kable() %>%
kable_styling(full_width = FALSE, position = ‘center’) %>%
scroll_box(width = ‘800px’, height = ‘300px’)

Note: the below is a snapshot. Visit Robot Wealth website to see the full set of long data.

Long data is presented with one or more columns containing a key and another containing all the values.

In this example, the key, or so-called “unit of analysis: is date-ticker. That is, each value (in the returns column) is uniquely associated with a date-ticker joint key.

The joint key date-ticker would be the starting point of any analysis we’d want to do on this data set.

This is often easier to manage and process. However, if you’re used to looking at spreadsheets, it can be harder to understand intuitively. (I think that this difficulty evaporates fairly quickly once you start using the tools).

While structuring data as key-value pairs might seem odd if you’re not used to it, it does actually facilitate your conceptual clarity of the problem at hand.

For example, in the example above, it is clear that the unique identifier of each return is the date-ticker joint key.

With that clarity, it becomes much simpler to imagine the steps in an analysis workflow. You get quite productive and effective at this with a little practice.

Let’s compare this with the same data in wide format:

dailyindex_df %>%
pivot_wider(names_from = ticker, values_from = returns) %>%
kable() %>%
kable_styling(position = ‘center’) %>%
scroll_box(width = ‘800px’, height = ‘300px’)

Note: the below is a snapshot. Visit Robot Wealth website to see the full set of wide data.

This might look more familiar. Here we have a row for each date and a column for the return corresponding to each index. The unique values in the ticker column are actual columns in this wide format.

Data in this format is probably more amenable human consumption.

So which is better – wide or long format?

It depends!

You’ll find that storing your data in long format facilitates exploration and analysis, particularly if you use the tidyverse tools. We highly recommend that you do all your tidy analysis in long format unless you have a good reason not to.

Long format data is also easy to maintain – adding a new variable (a new ticker, say) is as simple as appending rows to the bottom of the existing data frame (and maybe sorting it by date, if you wanted to).

One use case that you see all the time is using ggplot to visualise a variable for more than one member of some organising category, for example, a time series plot of a bunch of different price curves where the organising category is ticker.

On the other hand, wide-format data might be a better choice if you intend for a human to consume the data.

You will also find certain functions and algorithms that expect data in this format, for example stats::cor.

Visit Robot Wealth website to read the full article and watch the instructional video:

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from Robot Wealth and is being posted with its permission. The views expressed in this material are solely those of the author and/or Robot Wealth and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.