The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality

Source: Bailey, D. H. & López de Prado, M. (2014) · Journal of Portfolio Management 40(5), 94–107

TL;DR

Proposes the Deflated Sharpe Ratio (DSR), which adjusts an observed Sharpe ratio for three things standard tests ignore: the number of strategy configurations tried (selection bias / multiple testing), the length of the backtest, and non-normality (skewness and excess kurtosis) of returns. The DSR asks whether a Sharpe is genuinely significant or merely the maximum of many lucky trials, and returns the probability that the true Sharpe exceeds zero.

Problem it solves

With big data and cheap compute, researchers backtest millions of strategy variants and report the best. The winning Sharpe is then an order statistic (a maximum), not a single random draw, so conventional significance tests — which take one Sharpe at face value — vastly overstate confidence. This is the quantitative engine of backtest overfitting and selection bias (only positive results get reported).

The method

Builds on the Probabilistic Sharpe Ratio (PSR) (Bailey & López de Prado 2012), which accounts for the sample length T and the first four moments (incorporating skewness γ̂₃ and kurtosis γ̂₄ into the Sharpe's standard error).

Derives the expected maximum Sharpe under the null of zero skill across N independent trials via extreme value theory. The benchmark grows with N and with the dispersion of the trials, approximately:

E[max SR] ≈ √Var{SR} · [ (1−γ)·Z⁻¹(1 − 1/N) + γ·Z⁻¹(1 − 1/(N·e)) ],

where γ ≈ 0.5772 is the Euler–Mascheroni constant, Z⁻¹ the inverse standard-normal CDF, e Euler's number, and Var{SR} the variance of the Sharpe ratios across the trials.

DSR = PSR evaluated against this deflated threshold SR₀ = E[max SR] instead of zero: the probability that the true Sharpe beats the level achievable by chance after N trials. It thus deflates SR using five extra inputs: N, Var{SR}, T, skewness, and kurtosis.

Also yields a Minimum Track Record Length (MinTRL) / minimum backtest length needed to establish significance at a chosen confidence.

Assumptions & inputs

Inputs: the candidate Sharpe, return series length T, return skewness and kurtosis, the number of (effectively independent) trials N, and the variance of the trial Sharpe ratios.

The expected-max formula assumes independent trials; an appendix extends N to the non-independent case (clustering correlated trials into an effective count).

How to use it

Report N — the number of configurations tried — and deflate the Sharpe accordingly; the winner of a large search needs a far higher raw Sharpe to be credible.

Prefer longer backtests; be skeptical of short, heavily optimized ones (use MinTRL).

The deflation threshold plays a role analogous to the multiple-testing cutoff of Harvey & Liu (2014) / Harvey, Liu & Zhu (2016); the two are complementary.

Limitations & pitfalls

Requires honest knowledge of N and the trial Sharpe variance — usually unobserved and easy to understate, biasing DSR upward.

Estimating the effective number of independent trials under correlated strategies is itself nontrivial.

Corrects for selection bias and non-normality, but not for other backtest pathologies (look-ahead, survivorship, costs).

Key references

Bailey, D. & López de Prado, M. (2014) — The Deflated Sharpe Ratio — Journal of Portfolio Management

Bailey, D. & López de Prado, M. (2012) — The Sharpe Ratio Efficient Frontier (Probabilistic Sharpe Ratio) — Journal of Risk

Harvey, C., Liu, Y. & Zhu, H. (2016) — … and the Cross-Section of Expected Returns — Review of Financial Studies

Provenance: verified/generated from the paper's full text.