Bag of Words: Approach, Python Code, Limitations

In this blog, we will study the Bag of Words method for creating vectorized representations of text data. These representations can then be used to perform Natural Language Processing tasks such as Sentiment Analysis. We’ll understand the relevant terms, limitations, and further highlight the advantages of the method. The topics covered are:

Bag of Words Approach
Limitations of Bag of Words
Bag of Words vs Word2Vec
Advantages of Bag of Words

Bag of Words is a simplified feature extraction method for text data that is easy to implement. It involves maintaining a vocabulary and calculating the frequency of words, ignoring various abstractions of natural language such as grammar and word sequence.

Bag of Words Approach

The Bag of Words approach takes a document as input and breaks it into words. These words are also known as tokens and the process is termed as tokenization.

Unique tokens collected from all processed documents then constitute to form an ordered vocabulary. Finally, a vector of length equivalent to the size of the vocabulary is created for each document with values representative of the frequency of the tokens appearing in the respective document.

Note that, we ignore the order in which these words appear in our document. Hence the name ‘Bag of Words’ signifying the unordered collection of items in a bag. We can easily implement this approach in python. Below is an example demonstrating the same.

# corpus is a collection of documents, here sentences
corpus = [‘This is the first sentence in our corpus followed by one more sentence to demonstrate Bag of words’,
‘This is the second sentence in our corpus with a FEW UPPER CASE WORDS and Few Title Case Words’]

vocab = [] # empty list for vocabulary
total_words = 0 # to count total words in corpus

for doc in corpus: # iterating through documents in corpus
token_temp = doc.split() # create tokens
total_words = total_words + len(token_temp)
for i in range(len(token_temp)):
if token_temp[i] not in vocab: # to check if word is already in vocab
vocab.append(token_temp[i])

vocab.sort()

print(vocab) # Print all the words in vocabulary
print(‘There are {} words in vocabulary.’.format(len(vocab)))
print(‘A total of {} words is used in documents.’.format(total_words))

Note the difference in the number of total words and length of vocabulary. We’ll now calculate the frequencies of words appearing in each document and store it in a dictionary.

bow_vec = [] # list to store bag of words vectors

for i in range(len(corpus)):
doc_ = corpus[i].split()
doc_vec = [] # empty array for each doc

for j in range(len(vocab)): # iterate over vocab
if vocab[j] in doc_:
doc_vec.append(l_[i][vocab[j]]) # append freq if present
else:
doc_vec.append(0) # else append zero
bow_vec.append(doc_vec)

import pandas as pd
pd.set_option(“display.max_columns”, None)
df = pd.DataFrame(bow_vec, columns = vocab)
df # bag of words vectorized representation

Stay tuned for the next installment in this series, in which the author will discuss Limitations of Bag of Words.

To download the complete Python code, visit QuantInsti: https://blog.quantinsti.com/bag-of-words/

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from QuantInsti and is being posted with its permission. The views expressed in this material are solely those of the author and/or QuantInsti and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

Bag of Words: Approach, Python Code, Limitations

Posted August 17, 2020

Bag of Words Approach

Disclosure: Interactive Brokers

IBKR Campus Newsletters

Interactive Brokers Canada Inc.

Interactive Brokers Australia Pty. Ltd.

Interactive Brokers Hong Kong Limited

Interactive Brokers India Pvt. Ltd.

Interactive Brokers Securities Japan Inc.

Interactive Brokers Singapore Pte. Ltd.

Bag of Words Approach

Related Tags

Disclosure: Interactive Brokers

IBKR Campus Newsletters