Machine Learning for Algorithmic Trading in Python: A Complete Guide – Part III

See Part I for an overview and Part II for creating hyperparameters.

Splitting the data into test and train sets

First, let us split the data into the input values and the prediction values. Here we pass on the OHLC data with one day lag as the data frame X and the Close values of the current day as y. Note the column names below in lower-case.

X = df[['open','high','low','close']]
y = df['Close']

Test_traind_data.py hosted with ❤ by GitHub

In this example, to keep the machine learning for algorithmic trading with Python tutorial short and relevant, I have chosen not to create any polynomial features but to use only the raw data.

If you are interested in various combinations of the input parameters and with higher degree polynomial features, you are free to transform the data using the PolynomialFeature() function from the preprocessing package of scikit learn.

You can find detailed information in Quantra course on Python for Machine Learning in Finance.

Now, let us also create a dictionary that holds the size of the train data set and its corresponding average prediction error.

avg_err={}

Average_error.py hosted with ❤ by GitHub

Getting the best-fit parameters to create a new function

I want to measure the performance of the regression function as compared to the size of the input dataset. In other words, I want to see if, by increasing the input data, we will be able to reduce the error. For this, I used for loop to iterate over the same data set but with different lengths.

At this point, I would like to add that for those of you who are interested, explore the ‘reset’ function and how it will help us make a more reliable prediction.

(Hint: It is a part of the Python magic commands)

import numpy as np
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer

imp = SimpleImputer()

for t in np.arange(50, 97, 3):
    split = int(t * len(X) / 100)
    reg = GridSearchCV(estimator=Lasso(), param_grid={'alpha': [0.1, 0.5, 1.0], 'max_iter': [1000, 2000, 5000]})
    reg.fit(X[:split], y[:split])
    best_alpha = reg.best_params_['alpha']
    best_iter = reg.best_params_['max_iter']
    reg1 = Lasso(alpha=best_alpha, max_iter=best_iter)
    X = imp.fit_transform(X, y)
    reg1.fit(X[:split], y[:split])

Best_fit_parameters.py hosted with ❤ by GitHub

Let me explain what I did in a few steps.

First, I created a set of periodic numbers ‘t’ starting from 50 to 97, in steps of 3. The purpose of these numbers is to choose the percentage size of the dataset that will be used as the train data set.

Second, for a given value of ‘t’, I split the length of the data set to the nearest integer corresponding to this percentage. Then I divided the total data into train data, which includes the data from the beginning till the split, and test data, which includes the data from the split till the end. The reason for adopting this approach and not using the random split is to maintain the continuity of the time series.

After this, we pull the best parameters that generated the lowest cross-validation error and then use these parameters to create a new reg1 function, a simple Lasso regression fit with the best parameters.

Stay tuned for the next installment in this series for more details on how test and train the algo.

Originally posted on QuantInsti Blog.

Join The Conversation

If you have a general question, it may already be covered in our FAQs. If you have an account-specific question or concern, please reach out to Client Services.

Visit IBKR.com Open an IBKR Account