Recently, I have been interested in applying machine learning to trading. This post contains some of my thoughts regarding a framework for thinking about trading as a machine learning problem, treating trading as a classification or regression problem, and transforming the output of a machine learning model into a trading signal.

# Introduction to Machine Learning Applications to Trading

Machine learning refers to the construction of algorithms that can learn from and make predictions from data. Fundamentally, the purpose of any trader is to predict the future and enter into positions in financial securities that reflect this view of the future, so how can we make trading a machine learning problem?

First, let us define machine learning with more precision. Suppose we observe a response variable Y and p different predictor variables, X_{1}, X_{2}, …, X_{p}. We assume that there is some relationship between the response variable and the predictor variables in the following form:

Y = f(X) + ε

The interpretation is that Y is some unknown function of X plus some random error term which is independent of X and with mean zero. Machine learning refers to the methods used for estimating f. Why do we want to estimate f? Often times we have available to us the predictor variables X but not the response variable Y. If we have a good estimate of f, then we can create good predictions for Y, the variable that we care about. Other times we are interested in learning more about the relationship between X and Y.

Back to trading. As I said before, fundamentally, the purpose of any trader is to predict the future and specifically the future returns of some financial security. So to make trading a machine learning problem, all we have to do is let the response variable Y represent future returns. The future returns can be on any time scale depending on your trading strategy, from the next tick (for high frequency strategies) to the next 12 months or beyond (for long-term strategies).

In the context of trading, what is X? X could be anything that is available to us today that we suspect has some relationship with future returns. X could be macroeconomic indicators, fundamental data, market data, or anything else that you can think of.

Suppose you have a set of historical training observations where each observation contains predictor variables that are known at that point in time as well as the return of some financial security over the next period (the future returns of the security relative to that point in time).

This historical training set can be used to train a machine learning algorithm, and the hope is that the the predictions from this machine learning algorithm will perform well not only on the training data but also on test observations that were not used to train the algorithm.

# 2. Trading as a Classification Versus Regression Problem

Variables can either be quantitative or qualitative. Quantitative variables take on numerical values while qualitative variables take on values from one or more categories. A person’s age is an example of a quantitative variable while a person’s gender is an example of a qualitative variable.

The terminology in the machine learning field is to refer to attempts to predict a quantitative variable as a *regression* problem while attempts to predict a qualitative variable as a *classification* problem.

Given that we want to use machine learning to predict future returns of some security, should we treat trading as a regression problem or a classification problem? I have seen implementations using both approaches and each approach has a different set of learning algorithms available.

# 2.1 Trading as a Classification Problem

Let us examine trading as a classification problem first. In this case, the learning method is simply predicting whether the future returns of some security is positive or negative. Here the future returns of the security can be encoded as a 0 or 1, where 0 represents a negative future return and 1 represents a positive future return.

Predicting the future is hard, so sometimes it can be better to simplify the problem to just predicting the direction of future returns instead of the direction and magnitude. This may be sufficient if you are holding for short time periods (one day or less) where the range of potential returns is narrow.

Another consideration for treating trading as a classification problem is that several classification machine learning methods output the probability that the future return will be positive or negative. This can be extremely helpful when it is time to transform the predictions of the machine learning method into a trading signal because probability represents confidence.

A high probability that the future return will be positive or negative implies a strong long or short trading signal. You can even implement logic where the trading signal will be flat (take no position) when the confidence of choosing either positive or negative returns is too low.

# 2.2 Trading as a Regression Problem

The argument for treating trading as a regression problem is that traders want to optimize the overall return of the strategy and not optimize the percentage of profitable trades. A strategy that sustains small losses often but has a small chance of an extremely large gain can still be profitable. So clearly predicting the direction and magnitude of future returns is important, and that implies treating trading as a regression problem.

This can be more appropriate for strategies with longer holding periods where the range of potential returns is wide.

Predicting future returns as a numerical value also allows for ranking across securities. Suppose a machine learning model predicts the future returns of 100 securities. These 100 securities can then be ranked and a portfolio can be constructed that goes long the top 10 securities and goes short the bottom 10 securities.

# 3. Creating the Trading Signal

Once the machine learning model has generated predictions for the future returns of one or more securities, transforming that output into a trading signal is straightforward. There are two simple principles to keep in mind:

- If the model predicts a positive future return, you should go long the security and vice versa.
- The more confidence the model has in its prediction or the greater the magnitude of the predicted future return, the stronger the signal should be. I briefly described what a trading signal is in my previous post.

For a classification problem, a simple example is to go long the security when the model predicts a greater than 50% probability of a positive future return and to go short the security when the model predicts a less than 50% probability.

The signal can be continuous (the signal can take values like +0.5 which means enter into a long position with only 50% of allocated capital) or discrete (the signal can only take values -1, 0, and +1, for example). Below is a contrived example of a plot of a trading signal.

The specific implementation can vary depending on the strategy, but the logic that is applied to create the trading signal should conform to these two principles.

# 4. Future Research

My cursory review of the available literature for quantitative trading and machine learning suggests there is a large amount of common ground between the two fields. But the material available for quantitative trading doesn’t place sufficient emphasis on some of the most fundamental techniques that are common in machine learning like strict separation of data into training and test sets, cross validation, consideration given to the bias-variance tradeoff, and assessing the quality of model fit.

In my next post, I will implement a simple learning algorithm in R for trading the SPY. The motivation is to slowly build out a suite of R code and functions for use in actual quantitative trading strategies.

I have an email list where I occasionally send updates to readers on the trading systems that I’m developing. If you are interested, please enter your email below.

This is very interesting. I am looking for a machine learning approach to predict peaks and trough in ranging data. Is this a classification problem?

Pingback: Quantocracy's Daily Wrap for 02/01/2017 | Quantocracy

Keep up the fantastic work kevin! my favourite R quant blog! . Been dabbling a lot with different machine learning strategies in R, recently strategies in conjunction with leveraging COT-reports. There’s some really solid libraries in R. Looking forward to future posts! 🙂

Thanks for the kind words. The COT reports look interesting. Seems like a good and clean dataset to explore.

Are you thinking of genetic programming? How about sampling methods?

Nick de Peyster

http://undervaluedstocks.info/

Sorry, I don’t have any first hand experience in those fields. But thank you for mentioning that, it’s something that I can add to my list of things to look into.