…and the Cross-Section of Expected Returns

Source: Harvey, Liu & Zhu (2016) · Review of Financial Studies · DOI: 10.1093/rfs/hhv059

TL;DR

With 300+ published factors, the conventional significance hurdle (t-statistic > 2.0) is far too lenient — it guarantees a flood of false discoveries. Accounting for multiple testing, the authors argue a newly proposed factor should clear a t-statistic of roughly 3.0 (about a 0.5% significance level), and that bar should keep rising as more factors are tested. Most of the "factor zoo" would not survive.

The problem it addresses

Empirical asset pricing has a multiple-testing crisis. Decades of researchers have tested thousands of candidate predictors and published the ones that cleared t > 2. But when you run enough independent tests, some will exceed t = 2 purely by chance. The single-test threshold ignores the vast number of factors tried (including the unpublished, unreported ones), so the published cross-section is riddled with false positives — the "factor zoo."

Main findings

The 2.0 hurdle is obsolete. Under standard multiple-testing corrections, a t-stat of 2.0 corresponds to an unacceptably high false-discovery rate given the number of factors tested.

New threshold ≈ 3.0. Applying multiple-testing adjustments, a newly discovered factor needs a t-statistic around 3.0 to be credible — and the required hurdle increases over time as the count of tested factors grows (they estimate it should already be ~3.0+ and rising past ~3.4 in later years).

Most published factors fail. A large fraction of the 300+ documented factors would not clear the corrected bar; many are likely spurious.

The true test count is understated. Because failed tests go unpublished (the file-drawer problem), the real number of trials — and thus the proper hurdle — is even higher than the published record implies.

Methodology

Compile a history of 300+ published factors with their reported t-statistics and publication dates.

Apply three multiple-testing frameworks: Bonferroni and Holm (control the family-wise error rate) and Benjamini-Hochberg-Yekutieli (control the false-discovery rate).

Translate each into a time-varying t-statistic hurdle as a function of the cumulative number of factors tested, adjusting for the unobserved file-drawer of unpublished tests.

Re-evaluate the published factors against the corrected thresholds.

Implications for factor investing

Demand t ≈ 3.0+, not 2.0, for any newly claimed factor — and treat marginal (t between 2 and 3) "discoveries" as probably noise.

Out-of-sample validation is non-negotiable. Multiple-testing math says in-sample significance is cheap; only genuine out-of-sample performance (on data not used to find the signal) is persuasive — the principle ConvexPi's hidden evaluation period operationalizes.

Account for your own search. If you scan many signals, your personal hurdle must rise accordingly; report how many you tried, not just the winner.

Be skeptical of the zoo. Pair this with McLean-Pontiff (2016): even "real" factors decay post-publication, and many published ones were never real to begin with.

Key references

Harvey, C., Liu, Y. & Zhu, H. (2016) — …and the Cross-Section of Expected Returns — Review of Financial Studies — DOI: 10.1093/rfs/hhv059

Harvey, C. & Liu, Y. (2020) — False (and Missed) Discoveries in Financial Economics — Journal of Finance

McLean, R. D. & Pontiff, J. (2016) — Does Academic Research Destroy Stock Return Predictability? — Journal of Finance

Hou, K., Xue, C. & Zhang, L. (2020) — Replicating Anomalies — Review of Financial Studies

Chen, A. & Zimmermann, T. (2022) — Open Source Cross-Sectional Asset Pricing — Critical Finance Review