When you see the performance of a trading system, how do you know it's okay? How do you know it's the right system for you?

Many people look at the net profit assuming the more profit system must be the better system. This is often far from a good idea. When comparing trading systems during the development process or when comparing systems before making a purchase, it is nice to have a few metrics on hand that will allow you to compare the system either to a hypothetical benchmark or against another system.

There is no one single score you can use that will work for everyone since we all have unique risk tolerances and definitions on what we consider tradable. Likewise, not all scoring systems are equal or perform under all circumstances. However, in this article, I will talk about my favorite methods used to score and rank trading systems. These are the key system performance metrics that I use during the system development process.

### Number of Trades

Any trading system should have a "significant" number of trades. What is significant? Well, that varies. For a swing system that takes no more than ten trades a year, having 100 trades is good. This represents about ten years of historical testing. As a given trading system starts to produce more trades per year, I would expect to see more trades utilized during backtesting.

Often you will read that 30 trades is a minimum to be statistically significant. Is that true? I'm not so sure as it depends upon when those trades took place. Market conditions change, and if those thirty trades do not take place in various market conditions, it may not be an excellent representation.

In the end, the more trades, the better. Intraday systems should have hundreds of trades. Longer-term swing systems should have over a hundred trades.

### Profit Factor

While net profit can be a factor in your decision about a particular trading system, profit factor is often even more critical, in my opinion. The profit factor measures the efficiency of your trading system. The profit factor is calculated by dividing the generated profit by the induced losses. A profit factor of 1.5 indicates that three dollars are gained for every two dollars ($3 win / $2 lost = 1.5). A number above 1.0 means you are making money. I like to see a profit factor of 1.5 or higher.

### Average Profit Per Trade

Like the profit factor, the average profit per trade tells me if a system is making enough money on each trade. When designing a trading system I like to see an average profitable trade above $50 before commissions and slippage are deducted at an absolute minimum. If the average net profit is above $75 with commissions and slippage deducted, I feel even better. The higher the average profit per trade, the better.

### Percent Winning Trades

I don't follow this too much. I make a note of it, but it's not all that important to me. The percent winning trades is simply the number of trades that generated a positive net profit divided by all trades taken. This factor can be necessary if you don't like to have a large string of losers. For example, often longer-term trend-following systems can be very profitable, but only have a win rate of 40% or less. Can you handle many losing trades? Maybe you are only comfortable with systems that tend to produce more winning trades than losing trades. If so, then a system with a win rate of 60% or higher would be better for you. Percent of winning trades is a psychological tolerance indicator that will vary between people.

### Compounded Annual Growth Rate (CAGR)

This describes the growth as if it were a steady, fixed rate of return. We all know such a smooth ride does not happen when trading as your trading system. Yet, this is a way to smooth your return over the same trading period. Let's say your trading system produces a 5% CAGR over ten years. Over that same period, you have a bank CD that also yields a 5% return over the same time frame. Does this make the CD a better investment? Maybe. One thing to keep in mind is this: the CAGR calculation does not consider the time your money is at risk. For example, while the trading system may be retuning 5% CAGR over ten years, your money is only active in the market for a fraction of the time. It's sitting idle in your brokerage or futures account most of the time, waiting for the next trading signal. CAGR does not take into account the time your money is at risk. Remember, a 5% return in the CD is realized if your money is locked away 100% of the time. Our cash is also freed up to be put to use in other instruments with our example trading system.

### Risk-Adjusted Return (RAR)

This calculation takes into account the time your money is at risk in the market. This is done by taking the CAGR and dividing it by exposure. Exposure is the percentage of time (over the test period) that your money was actively in the market. I like to see a value of 50% or better.

### Maximum Intraday Drawdown and The Equity Curve

How significant are those drawdowns? Can I mentally handle such a drawdown? Along these lines, I also look at the shape of the equity curve. I think looking at the equity curve can give me an excellent feel for a trading system. When I look at an equity curve, I ask, does it climb with shallow pullbacks, or does it have steep pullbacks? Are there long extended periods with no new equity highs? Ideally, the equity curve should rise as time goes by, creating new equity highs with shallow pullbacks. I always try to imagine what it would be like to be trading that system. Could I handle it?

### Net Profit / Drawdown

I like this metric. I'll often use it as a fitness function as well. I like this because it incorporates both profit and drawdown. We all want to make more money with the smallest amount of drawdown, and this metric helps clarify that. The higher the number, the better.

### t-Test

This is one you don't see much of. The t-Test is a statistical test used to gauge how likely your trading system's results occurred by chance alone. You would like to see a value greater than 1.6, which indicates the trading results are more likely not to be based on luck. Any other value below shows the trading results might be based upon chance. The t-Test value should be calculated with no less than 30 trades. Below is the t-Test calculation.

t = square root ( number of trades ) * (average profit per trade trade / standard deviation of trades)

### Expectancy

Expectancy is a concept that was described in Van Tharp's book "Trade Your Way To Financial Freedom." Expectancy tells you, on average, how much you expect to make per dollar at risk. Expectancy might also be a value that you optimize when testing different strategy input combinations. While computing the actual expectancy of a trading system is beyond this article, it can be estimated with the following simple formula.

Expectancy = Average Net Profit Per Trade / | Average losing trade in dollars |

For those not too familiar with mathematics, the vertical lines around the “Average losing trade in dollars” indicate that the absolute value should be used. This means if the number is a negative value, we drop the negative sign, thus making the value positive.

### Expectancy Score

This value is an annualized expectancy-value that produces an objective number used to compare various trading systems. In essence, the Expectancy Score factors in “opportunity” into the value by taking into account how frequently the given trading system produces trades. Thus, this score allows you to compare very different trading systems. The higher the expectancy, the more profitable the system.

Expectancy Score = Expectancy * Number of Trades * 365 / Number of strategy trading days

### Conclusion

With the above values, we can get a decent picture of how the system will perform. Of course, there are other values you could evaluate, and even more, you can do such as passing the historical trades through a Monte Carlo simulator. But these values discussed in this article are the essential values I utilize when designing a system or when evaluating a third party trading system.

Hi Jeff,

you correctly put “number of trades” at the top of your list, because all other metrics depend on the validity of your sample.

Unfortunately you appear to follow a line of reasoning (as many others do) that I believe to be false: that is giving importance to the length of the time span that the sample is derived from. You argued that 100 trades are good if they cover 10 years.

Well, 100 trades are 100 trades independently of the time span. So the important question is whether or not a sample of 100 has validity.

For an eye-opening read I strongly recommend: Kahneman, “Thinking Fast And Slow”, pg. 109ff

The point he makes is that small samples can easily be impacted by (rare) outliers. This effect is in no way reduced simply because your sample covers a longer time period!

Just as an example: How valid are system results based on 1000 years of trades? Very valid?

Well, what if the rules are to buy the major stock index at the close of last day of the century and sell on the open the following day? That would give you a whopping 10 trades to base your analysis on. Now how valid is that?

So how can we deal with the outlier problem?

My suggestion is to cut away the top x% of your winning trades (5%-10% appears reasonable to me) and examine how much your performance degrades (measured by whatever metrics you want to apply). This is called a sensitivity analysis and will show you how much your total performance relies on a few great trades (likely to be outliers that will not repeat as much or at all in the future).

A second method is the t-test that you mentioned. The only difference I would make is to apply it on the risk adjusted profit (or loss) per trade, not on the absolute P/L. Basically you divide the result (in amount of dollars) by the risk as defined by the initial stop (also in amount of dollars). From those numbers you then calculate the average and standard deviation that go into the formula you show above.

Anyway, try to get a copy of that book, which I strongly recommend to all traders!

Cheers,

TK

@TK

I would generally agree with what you are saying about sample size, and one question I was going to address to Jeff was whether, like most statistical procedures carried over to analysis of market data, a larger sample size than 10 is not necessary for the T-Test?

Regarding your suggestion above though (“cut away the top x% of your winning trades”), is this not dependent on the nature of the strategy in question? For instance, given a sample size of 100 and an exreme outlier-dependent strategy with a 10% win rate, then cutting the top 10% means disregarding one single trade. Though, over an average of 10,000 trades, the top X% may on average be only slightly more profitable than the top X+n%, in the case of the sample size of 100 the top X% trade (i.e. the one single trade selected), may be many multiples more profitable than the X+n% remaining.

In other words, what you describe, with certain types of systems and without a very large sample size, risks creating a counter-productive “black swan” style exclusion that is not representative of the intended effect of this process. Instead of mitigating the impact of outliers, you risk introducing a further outlier counter-measure.

Would be interested to hear your thoughts on this if I have explained myself well enough.

Regards,

BlueHorseshoe

Of course, above should have read X-n% – there’s no “edit” button like on the forums!

TK,

Thanks again for the thoughtful reply. So sorry for getting back to this thread days later. It was a hectic week last week. I’ve heard a lot of good things about “Thinking Fast and Slow”. I’m currently reading “The Big Short” and will add your recommendation to my reading list.

@BH

I am not sure if I understood your comment correctly, so forgive me if my answer should not match what you meant.

The purpose of the “cutting procedure” is to find out to what degree the test results were impacted by outliers. If there is a big impact then you have a high risk that the results of the SAMPLE (test) are NOT representative of the real life performance later on.

>>>” …is this not dependent on the nature of the strategy in question?”<< rugged equity curve with high variance).

>>>”For instance, given a sample size of 100 and an exreme outlier-dependent strategy with a 10% win rate, then cutting the top 10% means disregarding one single trade. Though, over an average of 10,000 trades, the top X% may on average be only slightly more profitable than the top X+n%, in the case of the sample size of 100 the top X% trade (i.e. the one single trade selected), may be many multiples more profitable than the X+n% remaining.”<<>>”In other words, what you describe, with certain types of systems and without a very large sample size, risks creating a counter-productive “black swan” style exclusion that is not representative of the intended effect of this process. Instead of mitigating the impact of outliers, you risk introducing a further outlier counter-measure.”<<<

I am not sure what you meant with this paragraph. What I talk about is eliminating the trades from the statistical evaluations. Of course you should NOT introduce any rules to your system that cut "home runs" short if they should really happen. My point is to see if the system can hold up if the excellent trades are a lot less frequent (in real life) than the sample may make you believe they might be.

Looking forward to your reply,

TK

@Jeff

Any comment regarding the lookback period vs. sample size? By the way: the system I described is called "Millenium Bull" and available for a few Galactic Credits at http://www.JabbaTheHut.com =:p

Sorry, there appears to be a problem. I try to post the missing section now:

” …is this not dependent on the nature of the strategy in question?”

Actually this procedure REVEALS the nature of the system. Is it made up of evenly sized profits (small impact of outliers = smooth equity curve with little variance) or a few “homeruns” (big impact of outliers => rugged equity curve with high variance).

TK

P.S. Jeff, feel free to delete the first repost.

What about assessing OOS performance or verifying whether any OOS performance was done at all?

While this article does not talk specifically about out-of-sample vs in-sample, the same metrics apply. Not during all instances will OOS performance available when looking to buy a system however, this is an important step in testing. if you are developing a system you should always test OOS data as it gives you a better idea of how your system performs. The next step is to actually test it on live data. Many people are shocked to see their system fail to perform on the live market as bars form in real-time. Often this is due to an incomplete understanding on how bars are built tick-by-tick and how the trading code is executed against that data. This is very important for intraday trading systems.

Why do you count with 30 USD roundturn execution costs? Seems to high for me, excuse my ignorance please if I missed something.

Given ES, you add two ticks per trade (1 buy & 1 sell ) that’s $25 right there. Then you add $5.00 for a commission fee. That’s $30 per trade for execution costs.