Working with High-Frequency Tick Data – Cleaning the Data


Tick data is the most granular high-frequency data available, and so is the most useful in market microstructure analysis. Unfortunately, tick data is also the most susceptible to data corruption and so must be cleaned and conditioned prior to being used for any type of analysis.  

This article, written by Ryan Maxwell, examines how to handle and identify corrupt tick data (for analysts unfamiliar with tick data, please try an intro to tick data first).

Causes of data-corruption?

Tick data is especially vulnerable to data-corruption due to the high-volume of data – a high-volume stock tick data set such for MSFT (Microsoft)  can easily amount to 100,000 ticks per day, making error detection very challenging. Typically it is signal interruptions or signal delays that cause either corrupted or out-of-sequence data.

Defining ‘Bad’ Data

Before generating data filters, we first need to designate what constitutes a bad tick. It is a common error to make the test too restrictive and therefore eliminate valid data merely because it is not consistent with the data points close by it (in fact these ticks are often the most useful in trading simulations as they provide information on the market direction or they are trading opportunities themselves).

Thus, there is a need to balance the tradeoff between data completeness and data integrity based on how sensitive the analysis is to bad data.

What tools to use for data checking/cleaning

Unfortunately, there are very few off-the-shelf tools for cleaning time-series data and Excel is not suitable due to its memory requirements (on most systems Excel cannot efficiently work with spreadsheets over 1 million rows which may only be several weeks of tick data). Tools such as OpenRefine (formerly GoogleRefine) are typically more suited to structured data such as customer data. 

Custom Python scripts are probably the most flexible and efficient method and are the most commonly used method in machine learning on time-series datasets. 

Types of corrupt data and tests

There are numerous types of bad ticks, and each type will require a different test: 

Zero or Negative Prices or Volumes
This is the simplest test – ticks with zero or negative prices or volumes are clearly errors and can be immediately discarded.

Simultaneous Observations 
Multiple ticks can often be observed for the same timestamp. Since ultra high-frequency models for the modelling tick data typically require a single observation for each timestamp, some form of aggregation needs to be performed. In the case of bid/ask tick data we would use the highest bid and lowest offer (provided the bid is still less than or equal to the offer) and aggregating the volumes for both the bid and the ask price.
Trade data is more problematic as it cannot be easily aggregated. We would normally favour aggregating the volumes and then using a single volume-weighted price. 

Bid/Ask Bounce
This is the phenomena of price appearing to ‘bounce’ around when in fact all that is happening is the bid/ask quote remaining the same and traders selling at the bid and buying at the offer giving the impression of price movement on the trade tick data. 

The bid/ask bounce is the major reason why many analysts only use the bid/ask tick sequence and ignore the trade tick data. However, if using the trade tick data is essential, one method of eliminating ‘bounces’ is to only accept trade ticks which move the price by more than the bid/ask spread from the prior tick (this is a major reason why it is necessary to have both bid/ask tick as well as trade tick data).

Visit Quantpedia to read the full article:

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from Quantpedia and is being posted with permission from Quantpedia. The views expressed in this material are solely those of the author and/or Quantpedia and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

In accordance with EU regulation: The statements in this document shall not be considered as an objective or independent explanation of the matters. Please note that this document (a) has not been prepared in accordance with legal requirements designed to promote the independence of investment research, and (b) is not subject to any prohibition on dealing ahead of the dissemination or publication of investment research.

Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations.