The Boruta-Shap Algorithm: A CPU and GPU Version

After you do feature engineering, feature importance is a key step before deploying a strategy backtesting code. Boruta-Shap comes as a viable source for that purpose. However, this algorithm might take a lot of time to run with large datasets. This unique article provides us with an estimation of the mentioned algorithm using CPU parallelism and GPU to make it run faster. Code will be implemented using the XGBoost library and futures library for CPU parallelism.

We will cover:

What is the Boruta-Shap algorithm?
How the Boruta-Shap algorithm works
Significance of Boruta-Shap
Accelerating Boruta-Shap Algorithm
A CPU-and-GPU-based algorithm to run quicker the Boruta-Shap algorithm

What is the Boruta-Shap algorithm?

The Boruta-Shap algorithm is a good technique for feature selection, especially in machine learning and data science applications, is the Boruta-Shap algorithm. Boruta-Shap combines the Boruta feature selection process with the Shapley values to enhance feature importance assessment.

How the Boruta-Shap algorithm works

The Boruta-Shap algorithm works in the following way:

First, we create shuffled versions of all the input features.
Second, Boruta is used to identify a tentative set of important features using a machine learning model.
Then, Shapley values are calculated for these tentative features using the above model (often a tree-based model like Random Forest or Gradient Boosting Machine). The tentative features are chosen based on comparing their usefulness with respect to their shuffled versions.
The Shapley values provide a more nuanced understanding of feature importance, capturing interactions between features and their impact on model predictions.
Finally, features are ranked based on their Shapley values, helping to prioritize the most influential features for model training and interpretation.

Significance of Boruta-Shap

The Boruta-Shap algorithm has the following benefits.

Robustness – it can produce accurate feature importance rankings even for noisy, high-dimensional datasets.
Interpretability is aided by the use of Shapley values, which provide information on how each feature affects model predictions.
Boruta-Shap considers feature interactions and the value of individual features, which is important in complex datasets.
This algorithm is used before you do feature engineering.

Visit QuantInsti website to watch this clip: “Industry expert and renowned author, Dr. Ernest Chan talks about Financial Data Science & Feature Engineering and shares his knowledge:” https://blog.quantinsti.com/boruta-shap-gpu-python/

Accelerating Boruta-Shap Algorithm

Despite Boruta-Shap’s strength, its computational cost can be high, particularly for large datasets with many characteristics. To solve this, I’ve included a Boruta-Shap code that uses the CPU and GPU in tandem to expedite the Boruta-Shap’s execution. Cool, right?

This approach drastically cuts computation time by effectively allocating the workload and utilizing the parallel processing powers of both CPUs and GPUs.

A CPU-and-GPU-based algorithm to run quicker the Boruta-Shap algorithm

Let’s dissect the code. Depending on the number of cores available in your CPU, the code will group the number of trials in buckets and each bucket will be run in parallel. We use a modified version of the code provided by Moosa Ali (2022), who implements the CPU-based algorithm.

Let’s code!

# Import the necessary libraries
import scipy as sp
import numpy as np
import pandas as pd
import shap
from xgboost import XGBRFClassifier
from xgboost import XGBRFRegressor
from sklearn.preprocessing import LabelEncoder
from concurrent import futures

import_libraries.py hosted with ❤ by GitHub

The following function is responsible for computing the minimum number of trials needed as a threshold to accept an input feature as a selected feature based on the probability mass function (pmf) and a significance level. It iterates over the pmf and accumulates the probabilities until the cumulative probability exceeds the significance level.

# Set the minimum number of trials as a threshold number to accept an input feature as a selected feature
def get_tail_items(pmf, significance_level=0.05):
   # Set total to zero
   total = 0
   # Create a loop based on the probability mass function
   for i, x in enumerate(pmf):
       # Increment the total variable with the probability “x” of i
       total += x
       # If total is higher than the significance level
       if total >= significance_level:
           # Break the code
           break
   # Return i
   return i

get_tail_items_func.py hosted with ❤ by GitHub

The next function selects features based on the number of hits they receive during the trials. It categorizes features into two zones:

green zone (features with hits higher than a threshold) and
blue zone (features with hits between upper and lower thresholds).

# select features from n number of trials
def choose_features(feature_hits, TRIALS, thresh):
   # Define the  boundaries for the green zone
   # Define the green zone threshold
   green_zone_thresh = TRIALS - thresh
   # Define the blue zone upper threshold
   blue_zone_upper = green_zone_thresh
   # Define the blue zone lower threshold
   blue_zone_lower = thresh


   # Select the input features as green whenever their hits are higher than the green zone threshold
   green_zone = [key for key, value in feature_hits.items() if value >= green_zone_thresh]
   # Select the input features as blue whenever their hits are between the blue zone lower threshold and the blue zone upper threshold 
   blue_zone = [key for key, value in feature_hits.items() if (value >= blue_zone_lower and value < blue_zone_upper)]
   return green_zone, blue_zone

choose_features_func.py hosted with ❤ by GitHub

The following last function is the main function implementing the Boruta-Shap algorithm. It takes input data X and target variable y, along with optional parameters such as trials, workers, significance_level, and seed.

Find below what the function does:

Set the seed
It initializes a dictionary features_hits to track the number of hits for each feature.
Shuffled column names are generated for feature shuffling.
The data is split into training and testing sets.
Label encoding is applied to the target variable y.
A classification model (XGBRFClassifier, a tool from the XGBoost library) is defined. To make the classifier work with a GPU, you just need to set the tree_method to ‘gpu_hist’. Creating the model from scratch will be something quite complex. However, you can create the model using the Rapids libraries.
The features_hits_func function is defined to perform feature shuffling, model fitting, and Shapley value computation for each trial. This function can be run within a loop for each trial or all the trials can be computed in parallel with the CPU.
A multi-threading and a loop technique are used to run multiple trials concurrently. In this case, we group all the range of trials in buckets as per the number of workers (threads used). For example, if we have 25 trials and we have 10 threads to use:
We define params_list_for_loop as the first 20 trials and last_params_list as the last 5 trials. We will run the features_hits_func function for the first 10 trials in parallel.
Once this is run, we iterate to the next 10 trials, which will be run in parallel, too.
Once we’re done with that, we finally run the last 5 trials in parallel.
After all trials, the probability mass function is calculated, and the minimum number of trials as a threshold is determined.
Features are classified into green, blue, or rejected based on the thresholds and hits received.
The function returns the selected features. In case no features were selected, we select all.

def boruta_shap_algorithm(X, y, trials=20, workers=2, significance_level=0.05, seed=2024):
   # Set the seed
   np.random.seed(seed)       


   # Assert that the number of samples of both data match
   assert X.shape[0] == y.shape[0], "X and y dimensions don't coincide"


   # Set a dictionary to save the number of hits for each feature
   features_hits = {feature:0 for feature in X.columns}


   # Create the names of all the features shuffled
   shuffled_col_names = [str(column+'_shuffle') for column in X.columns]


   # Set the train and test X data
   X_train, X_test = X.iloc[:int(0.8*len(X))], X.iloc[int(0.8*len(X)):]


   # Set the label enconder object
   le = LabelEncoder()


   # Transform the y series to a prediction features useful for the machine-learning model
   label_encoded = le.fit_transform(y)


   # Transform the encoded label into a Pandas series
   y = pd.Series(data=label_encoded, index=y.index, name='y')


   # Set the y train data
   y_train = y.iloc[:int(0.8*len(y))]
  
   # define the model
   classifier = XGBRFClassifier(n_estimators=100, subsample=1, colsample_bynode=1, tree_method='gpu_hist', random_state=seed) 


   # Define a function to compute the number of times the features
   def features_hits_func(trial):
       # Set the seed for the trial
       np.random.seed(seed+trial)


       # Set the X train data for the shuffled features
       X_shuffle_train = X_train.apply(np.random.permutation)
       # Set the names for the X train shuffled features
       X_shuffle_train.columns = shuffled_col_names
       # Set the X-test data for the shuffled features
       X_shuffle_test = X_test.apply(np.random.permutation)
       # Set the names for the X-test shuffled features
       X_shuffle_test.columns = shuffled_col_names


       # Set the whole input features for the Boruta-Shap algorithm training
       X_boruta_train = pd.concat([X_train, X_shuffle_train], axis=1)
       # Set the whole input features for the Boruta-Shap algorithm test data
       X_boruta_test = pd.concat([X_test, X_shuffle_test], axis=1)


       # Fit the model
       model = classifier.fit(X_boruta_train, y_train)


       # Set the explainer object
       explainer = shap.TreeExplainer(model)


       # Get the Shap values for each feature
       shap_values = explainer.shap_values(X_boruta_test)


       # Set the mean value of each feature's Shap values
       features_importance = np.array(np.abs(shap_values).mean(0))
       # Set a dataframe with the above features' importance
       features_importance_df = pd.DataFrame(data=features_importance, index=X_boruta_test.columns, columns=['Values'])


       # Subset the feature importance dataframe with the non-shuffled features
       feature_imp_X = features_importance_df.iloc[:len(X.columns)]
       # Subset the feature importance dataframe with the shuffled features
       feature_imp_shuffled = features_importance_df.iloc[len(X.columns):]


       # Add one hit in case the feature is better than the best Shap value of all the shuffled features
       for feature in feature_imp_X.index:
           features_hits[feature] += int(feature_imp_X.loc[feature,'Values'] > feature_imp_shuffled['Values'].max())


   # Define a function to run multiple trials as per the maximum number of cores available in your CPU
   def multithreading_loop(function, params_list):
       # Set the number of lists we'll have as per the number of cores
       num_lists = int(np.floor(len(params_list)/workers))
       # Set the params list to be used to loop
       params_list_for_loop = params_list[:int(num_lists*workers)]
       # If the number of trials in the above list is higher than the num_lists
       if len(params_list)>int(num_lists*workers):
           # Create the last params list to be used to multithread the computations
           last_params_list = params_list[int(num_lists*workers):]


       # For each list of trials
       for i in range(0,num_lists):
           # Use the number of cores for the futures library executor
           with futures.ThreadPoolExecutor(workers) as executor:
               # Run the features_hits_func function to compute the hits in parallel
               list(executor.map(function, params_list_for_loop[int(workers*i):int(workers*(i+1))]))
       # Once you finish the above, run the last trials to be computed in parallel
       if len(params_list)>int(num_lists*workers):
           # Use the number of cores for the futures library executor
           with futures.ThreadPoolExecutor(len(last_params_list)) as executor:
               # Run the features_hits_func function to compute the hits in parallel
               list(executor.map(function, last_params_list))


   # Set the range for the number of trails as a list
   trails_list = [*range(trials)]


   # Run the loop to compute the trails in parallel in buckets
   multithreading_loop(features_hits_func, trails_list)     
  
   # Calculate the probability mass function: Get the Binomial distribution in "trials" number of buckets
   pmf = [sp.stats.binom.pmf(x, trials, .5) for x in range(trials + 1)]


   # Set the minimum number of trials as the threshold to classify an input feature as a selected feature
   thresh = get_tail_items(pmf, significance_level)
  
   # green are the accepted features, blue are the tentative features
   green, blue = choose_features(features_hits, trials, thresh)


   # If there are green features
   if len(green) != 0:
       # Return the green features
       return green
   # If there aren't green features
   else:
       # If there are blue features
       if len(blue) != 0:
           # Return the blue features
           return blue
       # If there aren't blue features
       else:
           # Return all the features
           return X.columns.tolist()

gpu_cpu_based_boruta_shap_func.py hosted with ❤ by GitHub

References

Ali, Moosa (2022). Boruta Feature Selection Explained in Python. Medium, https://medium.com/geekculture/boruta-feature-selection-explained-in-python-7ae8bf4aa1e7
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765-4774).
Piatetsky-Shapiro, G., & Mateosian, R. (2017). Boruta feature selection in r. KDnuggets, 17(19), 1-7.

Conclusion

You have learned how to create the Boruta-Shap algorithm using both the CPU and GPU. You’ll see a great difference, compared with using only the CPU, if you use a dataframe with many observations. Besides, the higher the number of threads and cores, the better the parallelism and the quicker the loop will run.

What’s next? You would ask.
Well, you can use the above code to get the feature importance before you backtest a strategy. We suggest you use the Boruta-Shap algorithm before you optimize a strategy’s parameters. You can find the source file below.

In case you want to learn more about machine learning, keep track of this learning track! You’ll learn the basics of machine learning in finance.

Now that you’ve grasped the power of Boruta Shap for identifying key features, you might be wondering how to put it into practice for real-world problems. Here’s where things get exciting! This Machine Learning & Deep Learning for Trading course by Quantra helps you learn these techniques for building advanced trading strategies. You’ll not only learn the theory behind Boruta Shap but also gain hands-on experience implementing it to select the most impactful features for your own trading algorithms.

It’s the perfect next step to turn your newfound knowledge into action!
Happy Learning!

Originally posted on QuantInsti blog.

Join The Conversation

If you have a general question, it may already be covered in our FAQs. If you have an account-specific question or concern, please reach out to Client Services.

Visit IBKR.com Open an IBKR Account

The Boruta-Shap Algorithm: A CPU and GPU Version

Posted July 23, 2024 at 1:00 pm

What is the Boruta-Shap algorithm?

How the Boruta-Shap algorithm works

Significance of Boruta-Shap

Accelerating Boruta-Shap Algorithm

A CPU-and-GPU-based algorithm to run quicker the Boruta-Shap algorithm

Conclusion

Join The Conversation

Leave a Reply Cancel reply

Disclosure: Interactive Brokers

Interactive Brokers Canada Inc.

Interactive Brokers Australia Pty. Ltd.

Interactive Brokers Hong Kong Limited

Interactive Brokers India Pvt. Ltd.

Interactive Brokers Securities Japan Inc.

Interactive Brokers Singapore Pte. Ltd.

What is the Boruta-Shap algorithm?

How the Boruta-Shap algorithm works

Significance of Boruta-Shap

Accelerating Boruta-Shap Algorithm

A CPU-and-GPU-based algorithm to run quicker the Boruta-Shap algorithm

Conclusion

Related Tags

Join The Conversation

Leave a Reply Cancel reply

Disclosure: Interactive Brokers

IBKR Campus Newsletters

Bi-Weekly Newsletter

Daily Newsletter

Weekly Newsletter

Weekly Newsletter

Monthly Newsletter