Multilingual content from IBKR

Close Navigation
Learn more about IBKR accounts
How to Use Lexical Density of Company Filings

How to Use Lexical Density of Company Filings

Posted September 15, 2021 at 12:17 pm

Daniela Hanicová, Quant Analyst,
Filip Kalus, IT Developer/ QuantConnect code builder,
Radovan Vojtko, CEO & Head of Resarch,



The invention of the steam engine in 1698 marks the beginning of the first industrial revolution. Since then, we have made significant progress, and it seems we are not slowing down. Some say that Artificial Intelligence (AI) marks the start of the most recent industrial revolution.

Artificial Intelligence became a hot topic in recent years because of its variety of functions, including speech and language recognition. Natural language processing, or NLP for short, is the ability of a program to understand human language. You might ask, how is this useful in the financial sphere. Well, there are numerous research papers (Banker et al., 2021 and Joenväärä et al., 2019) analyzing the connection between investor’s vocabulary and the profitability of their strategies.

Specifically, the research by Joenväärä et al., 2019 inspired us to analyze various lexical metrics in 10-K & 10-Q reports. After adjusting for risk, they found that lexically diverse hedge funds outperform lexically homogeneous hedge funds. Furthermore, they explain that investors react correctly but not fully to the information on fund manager skill embedded in lexical diversity. Their results support the notion that linguistic skills are helpful for investment performance.

Moreover, alternative data is becoming a mainstream topic in investment management and algorithmic trading. For example, the textual analysis of 10-K & 10-Q filings can be used as a profitable part of investment portfolios (Padysak, 2020). All publicly traded companies have to file 10-K & 10-Q reports periodically. These reports consist of relevant information about financial performance. Nowadays, there is a gradual shift from numerical to text-based information, making the reports harder to analyze (Cohen, 2010). Still, the 10-K & 10-Q reports rightfully receive great interest from academics, investors and analysts.


BRAIN is one of the companies that analyze the 10-K & 10-Q reports using NLP. The main objective of The Brain Language Metrics on Company Filings (BLMCF) dataset is to monitor numerous language metrics on 10-Ks and 10-Qs company reports for approximately 6000+ US stocks. The BLMCF dataset consists of two parts. The first part contains the language metrics of the most recent 10-K or 10-Q report for each firm, such as:

  1. Financial sentiment
  2. Percentage of words belonging to financial domain classified by language types:
  • “Constraining” language
  • “Interesting” language
  • “Litigious” language
  • “Uncertainty” language
  1. Readability score
  2. Lexical metrics such as lexical density and richness
  3. Text statistics such as the report length and the average sentence length

The second part includes the differences between the two most recent 10-Ks or 10-Qs reports of the same period for each company.

This article focuses on the first section of the BLMCF dataset, specifically the Lexical metrics such as lexical richnesslexical density, and specific density.

In simple words, lexical richness says how many unique words are used by the author. The idea is that the more varied vocabulary the author has, the more complex the text is. The lexical richness is measured by the Type-Token Ratio (TTR), which is defined as the number of unique words divided by the total number of words. As a result, the higher the TTR, the higher the lexical complexity.

Secondly, lexical density measures the structure and complexity of human communication in a text. A high lexical density indicates a large amount of information-carrying words, and a low lexical density indicates relatively few information-carrying words. Lexical density is calculated as the number of so-called lexical tokens (verbs, nouns, adjectives, verbs except auxiliary verbs) divided by the total number of tokens.

Lastly, specific density measures how dense the report’s language is from a financial point of view. BRAIN uses a dictionary of financially relevant words as a reference. Specific density is then calculated as the ratio between the number of dictionary words present in the report divided by the total number of words.


This article analyses how lexical richness, lexical density, specific density, and their combinations affect the strategy returns. We created two investment universes, the first one contains the top 500 stocks by market capitalization from NYSE, NASDAQ and AMEX exchanges, and the second contains the top 3000 stocks. The first investment universe is highly liquid and contains only large-cap stocks. The second investment universe is made of large-cap, mid-cap and small-cap stocks. Our process for building an investment factor portfolio is to sort the stocks into deciles (quintiles) and create a long-short equity factor strategy (long top decile, short bottom decile). All the backtests are done on the Quantconnect platform, and the data is integrated into the platform itself. Additionally, it can be found here:

Suggested factor strategies are rebalanced on a monthly basis, and we use real historical bid-ask spreads (slippage). Trading costs (transaction fees) are omitted; however, they do not have a high impact on the resultant strategy, as the usual asset manager can achieve trading costs at the range of 1-2bps per trade.

We suspect lexical density and specific density to have the greatest effect on the return. This would mean that the more information-carrying words and the more finance-related words the report has, the better the company performs.

Visit Quantpedia to find out how the resultant factor strategy looks like:

Past performance is not indicative of future results.

Any stock, options or futures symbols displayed are for illustrative purposes only and are not intended to portray recommendations.

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from Quantpedia and is being posted with its permission. The views expressed in this material are solely those of the author and/or Quantpedia and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

Disclosure: Hedge Funds

Hedge Funds are highly speculative, and investors may lose their entire investment.

Disclosure: Margin Trading

Trading on margin is only for experienced investors with high risk tolerance. You may lose more than your initial investment. For additional information regarding margin loan rates, see

IBKR Campus Newsletters

This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.