Source: Harvey, Liu & Zhu (2016) · Review of Financial Studies · DOI: 10.1093/rfs/hhv059
TL;DR
With 300+ published factors, the conventional significance hurdle (t-statistic > 2.0) is far too lenient — it guarantees a flood of false discoveries. Accounting for multiple testing, the authors argue a newly proposed factor should clear a t-statistic of roughly 3.0 (about a 0.5% significance level), and that bar should keep rising as more factors are tested. Most of the "factor zoo" would not survive.
The problem it addresses
Empirical asset pricing has a multiple-testing crisis. Decades of researchers have tested thousands of candidate predictors and published the ones that cleared t > 2. But when you run enough independent tests, some will exceed t = 2 purely by chance. The single-test threshold ignores the vast number of factors tried (including the unpublished, unreported ones), so the published cross-section is riddled with false positives — the "factor zoo."
Main findings
The 2.0 hurdle is obsolete. Under standard multiple-testing corrections, a t-stat of 2.0 corresponds to an unacceptably high false-discovery rate given the number of factors tested.
New threshold ≈ 3.0. Applying multiple-testing adjustments, a newly discovered factor needs a t-statistic around 3.0 to be credible — and the required hurdle increases over time as the count of tested factors grows (they estimate it should already be ~3.0+ and rising past ~3.4 in later years).
Most published factors fail. A large fraction of the 300+ documented factors would not clear the corrected bar; many are likely spurious.
The true test count is understated. Because failed tests go unpublished (the file-drawer problem), the real number of trials — and thus the proper hurdle — is even higher than the published record implies.
Methodology
Compile a history of 300+ published factors with their reported t-statistics and publication dates.
Apply three multiple-testing frameworks: Bonferroni and Holm (control the family-wise error rate) and Benjamini-Hochberg-Yekutieli (control the false-discovery rate).
Translate each into a time-varying t-statistic hurdle as a function of the cumulative number of factors tested, adjusting for the unobserved file-drawer of unpublished tests.
Re-evaluate the published factors against the corrected thresholds.
Implications for factor investing
Demand t ≈ 3.0+, not 2.0, for any newly claimed factor — and treat marginal (t between 2 and 3) "discoveries" as probably noise.
Out-of-sample validation is non-negotiable. Multiple-testing math says in-sample significance is cheap; only genuine out-of-sample performance (on data not used to find the signal) is persuasive — the principle ConvexPi's hidden evaluation period operationalizes.
Account for your own search. If you scan many signals, your personal hurdle must rise accordingly; report how many you tried, not just the winner.
Be skeptical of the zoo. Pair this with McLean-Pontiff (2016): even "real" factors decay post-publication, and many published ones were never real to begin with.
Key references
Harvey, C., Liu, Y. & Zhu, H. (2016) — …and the Cross-Section of Expected Returns — Review of Financial Studies — DOI: 10.1093/rfs/hhv059
Harvey, C. & Liu, Y. (2020) — False (and Missed) Discoveries in Financial Economics — Journal of Finance
McLean, R. D. & Pontiff, J. (2016) — Does Academic Research Destroy Stock Return Predictability? — Journal of Finance
Hou, K., Xue, C. & Zhang, L. (2020) — Replicating Anomalies — Review of Financial Studies
Chen, A. & Zimmermann, T. (2022) — Open Source Cross-Sectional Asset Pricing — Critical Finance Review