The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality
Source: Bailey, D. H. & López de Prado, M. (2014) · Journal of Portfolio Management 40(5), 94–107
TL;DR
Proposes the Deflated Sharpe Ratio (DSR), which adjusts an observed Sharpe ratio for three things standard tests ignore: the number of strategy configurations tried (selection bias / multiple testing), the length of the backtest, and non-normality (skewness and excess kurtosis) of returns. The DSR asks whether a Sharpe is genuinely significant or merely the maximum of many lucky trials, and returns the probability that the true Sharpe exceeds zero.
Problem it solves
With big data and cheap compute, researchers backtest millions of strategy variants and report the best. The winning Sharpe is then an order statistic (a maximum), not a single random draw, so conventional significance tests — which take one Sharpe at face value — vastly overstate confidence. This is the quantitative engine of backtest overfitting and selection bias (only positive results get reported).
The method
Builds on the Probabilistic Sharpe Ratio (PSR) (Bailey & López de Prado 2012), which accounts for the sample length T and the first four moments (incorporating skewness γ̂₃ and kurtosis γ̂₄ into the Sharpe's standard error).
Derives the expected maximum Sharpe under the null of zero skill across N independent trials via extreme value theory. The benchmark grows with N and with the dispersion of the trials, approximately:
where γ ≈ 0.5772 is the Euler–Mascheroni constant, Z⁻¹ the inverse standard-normal CDF, e Euler's number, and Var{SR} the variance of the Sharpe ratios across the trials.
DSR = PSR evaluated against this deflated threshold SR₀ = E[max SR] instead of zero: the probability that the true Sharpe beats the level achievable by chance after N trials. It thus deflates SR using five extra inputs: N, Var{SR}, T, skewness, and kurtosis.
Also yields a Minimum Track Record Length (MinTRL) / minimum backtest length needed to establish significance at a chosen confidence.
Assumptions & inputs
Inputs: the candidate Sharpe, return series length T, return skewness and kurtosis, the number of (effectively independent) trials N, and the variance of the trial Sharpe ratios.
The expected-max formula assumes independent trials; an appendix extends N to the non-independent case (clustering correlated trials into an effective count).
How to use it
Report N — the number of configurations tried — and deflate the Sharpe accordingly; the winner of a large search needs a far higher raw Sharpe to be credible.
Prefer longer backtests; be skeptical of short, heavily optimized ones (use MinTRL).
The deflation threshold plays a role analogous to the multiple-testing cutoff of Harvey & Liu (2014) / Harvey, Liu & Zhu (2016); the two are complementary.
Limitations & pitfalls
Requires honest knowledge of N and the trial Sharpe variance — usually unobserved and easy to understate, biasing DSR upward.
Estimating the effective number of independent trials under correlated strategies is itself nontrivial.
Corrects for selection bias and non-normality, but not for other backtest pathologies (look-ahead, survivorship, costs).
Key references
Bailey, D. & López de Prado, M. (2014) — The Deflated Sharpe Ratio — Journal of Portfolio Management
Bailey, D. & López de Prado, M. (2012) — The Sharpe Ratio Efficient Frontier (Probabilistic Sharpe Ratio) — Journal of Risk
Harvey, C., Liu, Y. & Zhu, H. (2016) — … and the Cross-Section of Expected Returns — Review of Financial Studies
Provenance: verified/generated from the paper's full text.