Empirical Asset Pricing via Machine Learning

Source: Gu, S., Kelly, B. & Xiu, D. (2020) · Review of Financial Studies 33(5), 2223–2273 · doi:10.1093/rfs/hhaa009

TL;DR

A disciplined horse race of machine-learning methods — penalized linear models (elastic net), dimension reduction (PCR, PLS), generalized linear models with splines, random forests, gradient-boosted trees, and neural networks — for measuring equity risk premiums out of sample. Trees and neural networks win, tracing their edge to nonlinear predictor interactions that other methods miss. Neural-network performance peaks at three hidden layers (not deeper). Stock-level monthly out-of-sample R² is small in absolute terms (~0.40% for the best model) but economically large: a value-weighted long-short decile spread on NN forecasts earns an annualized OOS Sharpe of 1.35, more than doubling a leading regression benchmark. All methods agree the dominant signals are momentum, liquidity, and volatility.

Problem it solves

The asset-pricing predictor space is high-dimensional and likely nonlinear, with many characteristics, interactions, and macro conditioning. Linear factor models cannot exploit interactions or nonlinearity, and naive high-dimensional regression overfits given finance's very low signal-to-noise. ML can help, but only with regularization, dimension reduction, and strict out-of-sample discipline.

The method

Models compared: OLS (with Huber loss), elastic net (ENet), PCR, PLS, generalized linear model with splines (GLM), random forest (RF), gradient-boosted regression trees (GBR), and feed-forward neural networks NN1–NN5 (one to five hidden layers).

Common objective: predict each stock's next-month excess return; methods differ in how they regularize and introduce nonlinearity. Hyperparameters tuned by validation; ensembling and robust loss used.

Outputs assessed: out-of-sample predictive R²_oos, value-weighted long-short decile portfolio Sharpe ratios, and variable-importance rankings; Diebold–Mariano tests compare models.

Assumptions & inputs

Data: nearly 30,000 individual US stocks over 60 years, 1957–2016.

Predictors: 94 firm characteristics, each interacted with 8 aggregate time-series variables, plus 74 industry dummies — totaling more than 900 baseline signals (some methods expand further via nonlinear transformations).

Evaluation: a recursive expanding-window scheme splitting data into training / validation / out-of-sample blocks, so all reported results are genuinely out of sample.

How to use it

Prefer shallow neural nets (NN3 is best here) and ensemble trees over both linear models and very deep networks; the gains come from interactions plus honest OOS tuning, not model size.

Headline figures: NN R²_oos rises from 0.33% (NN1) to a peak of 0.40% (NN3); RF/GBR ~0.33–0.34%; ENet ~0.11%, PCR/PLS ~0.26–0.27%. NN long-short decile spread Sharpe 1.35; NN-timed S&P 500 Sharpe 0.77 vs 0.51 buy-and-hold.

Interpret models via variable importance: momentum, liquidity, and volatility variables dominate across all methods.

Limitations & pitfalls

Absolute predictive R² is tiny — returns are inherently low-signal — and gains are bounded; results are gross of trading costs.

Performance depends on careful regularization and the expanding-window protocol; naive deep nets or in-sample tuning overfit.

Variable importance shows which signals matter, not economic mechanism; structural/causal interpretation is limited.

Key references

Gu, S., Kelly, B. & Xiu, D. (2020) — Empirical Asset Pricing via Machine Learning — Review of Financial Studies

Kelly, B., Pruitt, S. & Su, Y. (2019) — Characteristics Are Covariances (IPCA) — Journal of Financial Economics

Kozak, S., Nagel, S. & Santosh, S. (2020) — Shrinking the Cross-Section — Journal of Financial Economics

Provenance: verified/generated from the paper's full text.