The fundamental idea behind predictive modeling is indicators may contain information that can be used to predict a forward looking variable, called a target. The task of a predictive model is to find and exploit any such information.
Date Trend Volatility Day_return
19950214 0.251 1.572 0.144
19950215 0.101 1.778 0.055
19950216 -0.167 2.004 -0.013
...
Suppose we provide several years of this data to a model and ask it to learn how to predict day_return, the one day forward return, from two indicators, one called trend and the other volatility. In the lingo of machine-learning this process is called model training. Then, we may at a later date calculate from recent prices that trend=0.225 and volatility=1.244 as of that day. The trained model may then make a prediction that the target variable day_return will be 0.152. (These are all made-up numbers.) Based on this prediction that the market is about to rise substantially, we may choose to take a long position.
Converting Predictions to Trade Decisions
Intuition tells us that we should put more faith in extreme predictions than in more common predictions near the center of the model’s prediction range. If a model predicts that the market will rise by 0.001 percent tomorrow, we would not be nearly as inclined to take a long position as if the model predicts a 5.8 percent rise. This intuition is correct, because our research has shown in general there is a large correspondence between the magnitude of a prediction and the likelihood of success of the associated trade. Predictions of large magnitude are more likely to signal profitable market moves than predictions of small magnitude. The standard method for making trade decisions based on predicted market moves is to compare the prediction to a fixed threshold. If the prediction is greater than or equal to an upper threshold (usually positive), take a long position. If the prediction is less than or equal to a lower threshold (usually negative), take a short position. The holding period for a position is implicit in the definition of the target. This will be discussed in detail in our book Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments (SSML) . It should be obvious that the threshold determines a tradeoff in the number of trades versus the accuracy rate of the trades. If we set a threshold near zero, the magnitude of the predictions will frequently exceed the threshold, and a position will be taken often. Such trades carry a relatively high rate of failure. Conversely, if we set a threshold that is far from zero, predicted market moves will only rarely lie beyond the threshold, so trades will be rare but have a relatively high success rate. We already noted that there is a large correspondence between the magnitude of a prediction and the likelihood of a trade’s success. Thus, by choosing an appropriate threshold, we can control whether we have a system that trades often but with only mediocre accuracy, or a system that trades rarely but with excellent accuracy.
TSSB automatically chooses optimal long and short thresholds by choosing them so as to maximize the profit factor for long systems and short systems separately. Profit factor, a common metric of trading system performance is the ratio of total gains on successful trades to total loses on failed trades. In order to prevent degenerate situations in which there is only one trade or very few trades, the user specifies a minimum number of trades that must be taken, either as an absolute number or as a minimum fraction of bars. In addition, TSSB has an option for using two thresholds on each side (long and short) so as to produce two sets of signals, one set for ‘normal reliability’ trades, and a more conservative set for ‘high reliability’ trades. Finally, in many applications, TSSB prints tables that show performance figures that would be obtained with varying thresholds.
Computation of thresholds and interpretation of trade results based on predictions relative to these thresholds are advanced topics that will be discussed in detail in SSML. For now, the reader needs to understand only the following concepts:
- The user specifies indicator variables based on recent observed history and target variables that portray future price movement
- TSSB is given raw historical market data (prices and perhaps other data, such as volume) and it generates an extensive database of indicator and target variables. One or more models are trained to predict the target given a set of indicators. In other words, the model learns to use the predictive information contained in the indicators in order to predict the future as exemplified by the target.
- Every time a prediction is made, the numerical value of this prediction is compared to a long or upper threshold. If the prediction is greater than or equal to the long threshold, a long position is taken. Similarly, the prediction is compared to a short or lower threshold, which will nearly always be less than the long threshold. If the prediction is less than or equal to the short threshold, a short position is taken.
- The holding period for a position is inherent in the target variable. This is discussed in detail in SSML.
- TSSB will report results for long and short systems separately, as well as net results for the combined systems.
Testing the Trading System
TSSB provides the ability to perform many tests of a predictive model trading or filtering system. The available testing methodologies will be discussed in detail SSML. However, so that the reader may understand the elementary trading/filtering system development and evaluation presented in the next chapter, we now discuss two general testing methodologies: cross validation and walkforward testing. These are the primary standards in many prediction applications, and both are available in TSSB in a variety of forms.
The principle underlying the vast majority of testing methodologies, including those included in TSSB, is that the complete historical dataset available to the developer is split into separate subsets. One subset, called the training set or the development set, is used to train the predictive model. The other subset, called the test set or the validation set, is used to evaluate performance of the trained model. (Note that the distinction between the terms test set and validation set is not consistent among experts, so the increasingly common convention is to use them interchangeably. The same is true of training set and development set.)
The key here is that no data that takes part in the training of the model is permitted to take part in its performance evaluation. Under fairly general conditions, this mutually exclusive separation guarantees that the performance measured in the test set is an unbiased estimate of future performance. In other words, although the observed performance will almost certainly not exactly equal the performance that will be seen in the future, it does not have a systematic bias toward optimistic or pessimistic values. Having an unbiased estimate of future performance is one of the two main goals of a trading system development and testing operation. The other goal is being able to perform a statistical significance test to estimate the probability that the performance level achieved could have been due to good luck. This advanced concept is beyond the scope of this brief overview but its is discussed in depth in SSML.
In the earliest days of model building and testing, when high speed computers were not readily available, splitting of the data into a training set and a test set was done exactly once. The developer would typically train the model using data through a date several years prior to the current date, and then test the model on subsequent data, ending with the most recent data available. This is an extremely inefficient use of the data. TSSB makes available both cross validation and walkforward testing. These techniques split the available data into training sets and test sets many times, and pool the performance statistics into a single unbiased estimate of the model-based trading system’s true capability. This extensive reuse of the data for both training and testing makes efficient use of precious and limited market history.
Walkforward Testing
Walkforward testing is straightforward, intuitive, and widely used. The principle is that we train the model on a relatively long block of data that ends a considerable time in the past. We test the trained model on a relatively short section of data that immediately follows the training block. Then we shift the training and testing blocks forward in time by an amount equal to the length of the test block and repeat the prior steps. Walkforward testing ends when we reach the end of the dataset. We compute the net performance figure by pooling all of the test block trades. Here is a simple example of walkforward testing:
1) Train the model using data from 1990 through 2007. Test the model on 2008 data.
2) Train the model using data from 1991 through 2008. Test the model on 2009 data.
3) Train the model using data from 1992 through 2009. Test the model on 2010 data.
Pool all trades from the tests of 2008, 2009, and 2010. These trades are used to compute an unbiased estimate of the performance of the model.
The primary advantage of walkforward testing is that it mimics real life. Most developers of automated trading systems periodically retrain or otherwise refine their model. Thus, the results of a walkforward test simulate the results that would have been obtained if the system had been actually traded. This is a compelling argument in favor of this testing methodology.
Another advantage of walkforward testing is that it correctly reflects the response of the model to nonstationarity in the market. All markets evolve and change their behavior over time, sometimes rotating through a number of different regimes. Loosely speaking, this change in market dynamics, and hence in relationships between indicator and target variables, is called nonstationarity. The best predictive models have a significant degree of robustness against such changes, and walkforward testing allows us to judge the robustness of a model.
TSSB’s ability to use a variety of testing block lengths makes it easy to evaluate the robustness of a model against nonstationarity. Suppose a model achieves excellent walkforward results when the test block is very short. In other words, the model is never asked to make predictions for data that is far past the date on which its training block ended. Now suppose the walkforward performance deteriorates if the test block is made longer. This indicates that the market is rapidly changing in ways that the model is not capable of handling. Such a model is risky and will require frequent retraining if it is to keep abreast of current market conditions. On the other hand, if walkforward performance holds up well as the length of the test block is increased, the model is robust against nonstationarity. This is a valuable attribute of a predictive-model based approach to trading system development. Look at Figure 1 which depicts the placement of the training and testing blocks (periods) along the time axis. Figure 1 above shows two situations.

The top section of the figure depicts walkforward with very short test blocks. The bottom section depicts very long test blocks. It can be useful to perform several walkforward tests of varying test block lengths in order to evaluate the degree to which the prediction model is robust against nonstationarity.
Walkforward testing has only one disadvantage relative to alternative testing methods such as cross validation: it is relatively inefficient when it comes to use of the available data. Only cases past the end of the first training block are ever used for testing. If you are willing to believe that the indicators and targets are reasonably stationary, this is a tragic waste of data. Cross validation, discussed in the next section, addresses this weakness.
Cross Validation
Rather than segregating all test cases at the end of the historical data block, as is done with walkforward testing, we can evenly distribute them throughout the available history. This is called cross validation. For example, we may test as follows:
1) Train using data from 2006 through 2008. Test the model on 2005 data.
2) Train using data from 2005 through 2008, excluding 2006. Test the model on 2006 data.
3) Train using data from 2005 through 2008, excluding 2007. Test the model on 2007 data.
4) Train using data from 2005 through 2008, excluding 2008. Test the model on 2008 data.
This idea of withholding interior ‘test’ blocks of data while training with the surrounding data is illustrated in Figure 2 below. In cross validation, each step is commonly called a fold.

The obvious advantage of cross validation over walkforward testing is that every available case becomes a test case at some point. However, there are several disadvantages to note. The most serious potential problem is that cross validation is sensitive to nonstationarity. In a walkforward test, only relatively recent cases serve as test subjects. But in cross validation, cases all the way back to the beginning of the dataset contribute to test performance results. If the behavior of the market in early days was so different than in later days that the relationship between indicators and the target has seriously changed, incorporating test results from those early days may not be advisable.
Another disadvantage is more philosophical than practical, but it is worthy of note. Unlike a walkforward test, cross validation does not mimic the real-life behavior of a trading system. In cross validation, except for the last fold, we are using data from the future to train the model being tested. In real life this data would not be known at the time that test cases are processed. Some skeptics will raise their eyebrows at this, even though when done correctly it is legitimate, providing nearly unbiased performance estimates. Finally, overlap problems, discussed in the next section, are more troublesome in cross validation than in walkforward tests.
Overlap Considerations
The discussions of cross validation and walkforward testing just presented assume that each case is independent of other cases. In other words, the assumption is that the values of variables for a case are not related to the values of other cases in the dataset. Unfortunately, this is almost never the situation. Cases that are near one another in time will tend to have similar values of indicators and/or targets. This generally comes about in one or both of the following ways:
- Many of the targets available in TSSB look further ahead than just the next bar. For example, suppose our target is the market trend over the next ten bars. This is the quantity we wish to predict in order to make trade decisions. If this value is high on a particular day, indicating that the market trends strongly upward over the subsequent ten days, then in all likelihood this value will also be high the following day, and it was probably high the prior day. Shifting ahead or back one day still leaves an overlap of nine days in that ten-day target window. Such case-to-case correlation in time series data is called serial correlation.
- In most trading systems, the indicators look back over a considerable time block. For example, an indicator may be the market trend over the prior 50 days, or a measure of volatility over the prior 100 days. As a result, indicators change very slowly over time. The values of indicators for a particular day are almost identical to the values in nearby days, before and after.
These facts have several important implications. Because indicators change only slowly, the model’s predictions also change slowly. Hence market positions change slowly; if a prediction is above a threshold, it will tend to remain above the threshold for multiple bars. Conversely, if a prediction is below a threshold, it will tend to remain below that threshold for some time. If the target is looking ahead more than one bar, which results in serial correlation as discussed above, then the result of serial correlation in both positions and targets is serial correlation in returns for the trading system. This immediately invalidates most common statistical significance tests such as the t-test, ordinary bootstrap, and Monte-Carlo permutation test. TSSB does include several statistical significance tests that can lessen the impact of serial correlation. In particular, the stationary bootstrap and tapered block bootstrap will be discussed elsewhere in SSML. Unfortunately, both of these tests rely on assumptions that are often shaky. We’ll return to this issue in more detail later when statistical tests are discussed. For the moment, understand that targets that look ahead more than one bar usually preclude tests of significance or force one to rely on tests having questionable validity.
Lack of independence in indicators and targets has another implication, this one potentially more serious than just invalidating significance tests. The legitimacy of the test results themselves can be undermined by bias. Luckily, this problem is easily solved with a TSSB option called OVERLAP. Its details are discussed in SSML. For now we will simply explore the nature of the problem.
The problem occurs near the boundaries between training data and test data. The simplest situation is for walkforward testing, because there is only one (moving) boundary. Suppose the target involves market movement ten days into the future. Consider the last case in the training block. Its target involves the first ten days after the test block begins. This case, like all training set cases, plays a role in the development of the predictive model. Now consider the case that immediately follows it, the first case in the test block. As has already been noted, its indicator values will be very similar to the indicator values of the prior case. Thus, the model’s prediction will also be similar to that of the prior case. Because the target looks ahead ten days and we have moved ahead only one day, leaving a nine-day overlap, the target for this test case will be similar to the target for the prior case. But the prior case, which is practically identical to this test case, took part in the training of the model! So we have a strong prejudice for the model to do a good job of predicting this case, whose indicators and target are similar to the training case. The result is optimistic bias, the worst sort. Our test results will exceed the results that would have been obtained from an honest test.
This boundary effect manifests itself in an additional fashion in cross validation. Of course, we still have the effect just described when we are near the end of the early section of the training set and the start of the test set. This is the left edge of the red regions in Figure 2. But we also have a boundary effect when we are near the end of the test set and the start of the later part of the training set. This is the right edge of each red region. As before, cases near each other but on opposite sides of the training set / test set boundary have similar values for indicators and the target, which results in optimistic bias in the performance estimate. The bottom line is that bias due to overlap at the boundary between training data and test data is a serious problem for both cross validation and walkforward testing. Fortunately, the user can invoke the OVERLAP option to alleviate this problem.
Get the Book
-- By David Aronson
Part 1 of this series can be found here, Predictive-Model Based Trading Systems, Part 1.
David Aronson is a pioneer in machine learning and nonlinear trading system development and signal boosting/filtering. Aronson is Co-designer of TSSB (Trading System Synthesis and Boosting) a software platform for the automated development of statistically sound predictive model based trading systems. He has worked in this field since 1979 and has been a Chartered Market Technician certified by The Market Technicians Association since 1992. He was an adjunct professor of finance, and regularly taught to MBA and financial engineering students a graduate-level course in technical analysis, data mining and predictive analytics. His recently released book, Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments, is a in-depth look at developing predictive-model-based trading systems using TSSB.
I think is what should be proven, not assumed. I saw no proof of this.
“Conversely, if we set a threshold that is far from zero, predicted market moves will only rarely lie beyond the threshold, so trades will be rare but have a relatively high success rate.”
If this is by design of the algorithm, then this may be fitting. But again this must be proven, not assumed.
Unless the above are proven there is a lot of wishful thinking involved here. I see a lot of hand waiving.
Adding to your comment Bob,
I’d like to know is what exactly is tested here? Is it:
a) an entry signal?
=> Whether or not a “large” moves is typically followed by that signal.
b) a trading system’s rule set?
=> Not only the entry matters, but also the price path until the exit as it influences the movement of stops as well as profit taking exit(s).
If it is b) then it could very well be that smaller moves allow for tighter initial stops thus actually providing a higher risk multiple & average profit per trade than big moves, that may be a lot more volatile (i.e., max. adverse excursion) and thus requiring larger initial stops.
So coming back to the “large correspondence between the magnitude of a prediction and the likelihood of success of the associated trade. Predictions of large magnitude are more likely to signal profitable market moves than predictions of small magnitude.”
To me “likelihood of success” sounds like “win-rate”, but that’s different from expectancy, which is probably meant by “profitable market moves”. But maybe I am misinterpreting those words.
Thanks for any further insights!
Cheers,
TK
Hi TK
What is being tested is a trading system based on a predictive model . In contrast to a rules based system where the rules are proposed by a human analyst, trading systems developed with TSSB are based on a statistical model that uses one or more indicators to predict a forward looking target variable. The target can be deifine in many different ways; forward return, hitting a price objective before a stop-loss point, etc. The trading system uses a simple threshold logic where larger predictions are presumed to have a greater likelihood of being accurate. Our research shows this to be the case. Thus a signal to go long or go short is triggered when the prediction exceeds an upper or lower threshold. During TSSB’s development of the trading system both the predictive model and the signal triggering threshold are designed to maximize the Profit Factor of the trading system.
So what is being tested is the profit factor produced by the combination of the predictive model and the signal thresholds.
“likelihood of success” refers to the likelihood of getting a profitable trading system.
best
David Aronson
•
Hi David
Thanks from a great blog. See my comments on Part 1. Did you read the interesting paper of Lopezdeprado and what is your view on his test for datasnooping/overfitting?
Hi David and others
I’m interested in developing forex trading systems and my questions to you are the following:
1. Is TSSB ideal for forex too? Are there any contraindications using it for experimenting forex systems?
2. What about your book (SSML)? Is is contains the same value for forex as for stocks trading?
Thanks in advance!
Bela, it’s my opinion that TSSB and SSML (which is a textbook on using TSSB) is relevant to any market. TSSB is a tool used to help discover trading models, so it’s not limited to what it can be applied to.
Hello All,
I recently bought the book and downloaded the TSSB software. I managed to load a EUR/USD dataset and train a model. Even though i know about Machine Learning and being a programmer myself i failed to understand the simplest of all : How TSSB prompts for a trading decision. Unfortunately the Book is not clear on this and more work is warranted so that a novice can get a grasp on the simplest of things and then going on to more difficult model implementations. If anyone knows more about this please send me an email : hgwelec [at] y a h o o . c o m
Hello all,
I think TSSB is a really good academic tool. I write ‘academic’ because although I have found promising trading systems I am unable to convert them into EasyLanguage or C#. It would be really helpful if some guidance could be given about this otherwise TSSB is ‘just pie in the sky’, that is, you discover promising systems which can never be deployed.
Thanks