Does it actually time
the market?

This report applies two classical market-timing regressions to the nightclaude walk-forward backtest: Treynor-Mazuy (1966) and Henriksson-Merton (1981). The strategy is benchmarked against both raw SPY and a leverage-matched SPY position that pays daily financing on the borrowed leg, so leverage and skill can be separated. Sub-period breakdowns isolate the COVID shock, the 2022 bear, and the 2023 to 2024 AI rally.

Sample

2,517 days

2016-05-18 → 2026-05-22

Avg target leverage

2.48×

CAPM β = 1.16 (so realized exposure ≈ target)

Sharpe vs leverage-matched

1.84 → 0.68

Strategy delivers 2.69× the Sharpe of a static 2.48× position

CAPM α (annualized)

51.9%

p = <0.001 · HAC SE · invariant to k·SPY rescaling

Data window

The standard academic regime split (pre-2008 / 2009 to 2019 / 2020+) is not feasible with this dataset. The SPY history available to the backtest runs 2016-05-18 to 2026-05-22: no pre-2008 data, and the 2009 to 2019 window is only covered from 2016 onward. Sub-period analysis is therefore restricted to what the backtest can observe: full sample, 2016 to 2019 (pre-COVID), 2020+ (COVID onward), plus three finer regime cuts: 2020 Q1 to Q2 (the actual crash), 2022 rate-hike bear, and 2023 to 2024 AI rally. Pre-2008 conclusions would require data that does not exist in this cache.

Methodology

Three regressions are estimated for each sub-period, on daily excess returns (annual risk-free rate = 4%, matching the evaluation harness). Standard errors are Newey-West HAC with lag selection L = ⌊4·(n/100)^2/9⌋ to handle serial correlation in daily returns.

CAPM (baseline)

r_p − r_f = α + β·(r_m − r_f) + ε

α is the unconditional excess return after controlling for market exposure.

Treynor-Mazuy (1966)

r_p − r_f = α + β·(r_m − r_f) + γ·(r_m − r_f)² + ε

γ > 0 means the payoff is convex in market return, exposure rises when the market is strong, falls when it's weak. This is the classical signature of positive market-timing skill.

Henriksson-Merton (1981)

r_p − r_f = α + β·(r_m − r_f) + γ·max(0, r_m − r_f) + ε

Equivalent to letting beta differ between up and down markets: β_down = β, β_up = β + γ. γ > 0 means higher beta in up markets.

Leverage-matched benchmark

Average target exposure to SPY is 2.48×. The corresponding passive comparator is r_lev = 2.48·r_SPY − (2.48 − 1)·r_f: a static 2.48× SPY position that pays daily financing on the 1.48 borrowed dollars. For descriptive statistics (Sharpe, total return, max DD, Calmar) this is the appropriate comparator and the skill-beyond-leverage premium is visible directly. For the regression α, this benchmark is mathematically equivalent to SPY: rescaling the regressor by k leaves α unchanged and divides β by k. The Sharpe gap, not the regression α, answers the question of whether skill exists beyond leverage.

Equity curve, drawdowns, exposure

Cumulative growth of $1 (log scale)

Strategy in terracotta, SPY (1×) navy dashed, leverage-matched (2.48× SPY financed) muted dashed. Log axis so multiplicative differences are visually proportional.

Drawdown comparison

Drawdown from rolling peak. Strategy drawdowns include slippage / spread / commission costs.

Target daily exposure (0× to 4× SPY)

Average target exposure over the full sample is 2.48×. The signal is clipped to max_position_pct = 4.0 by the position sizer. Days at the 4× cap: 1,200 (47.7% of days).

Performance summary

All strategy metrics are net of transaction costs (commission $0.005/share, spread 5 bps, slippage 3 bps, applied on every rebalance). The leverage-matched benchmark uses the period's actual average target leverage.

Period	Strategy (net of cost)					SPY (1×)			avg lev	Leverage-matched (financed)
Period	ann ret	ann vol	Sharpe	max DD	Calmar	ann ret	Sharpe	max DD	avg lev	ann ret	Sharpe	max DD
Full sample (2016-05 → 2026-05)	88.79%	35.91%	1.843	-29.28%	3.033	15.68%	0.684	-33.72%	2.48×	27.54%	0.684	-68.18%
2016 – 2019 (pre-COVID)	62.45%	32.77%	1.529	-29.28%	2.133	15.55%	0.906	-19.35%	2.79×	34.13%	0.906	-47.70%
2020 + (COVID onward)	105.35%	37.58%	2.000	-24.86%	4.238	15.72%	0.626	-33.72%	2.31×	24.85%	0.626	-64.99%
2020 Q1–Q2 (COVID crash & rebound)	111.27%	42.68%	1.869	-21.11%	5.272	-8.80%	-0.069	-33.72%	1.48×	-20.36%	-0.069	-46.86%
2022 (rate-hike bear market)	12.11%	24.98%	0.423	-16.39%	0.739	-18.24%	-0.871	-24.50%	0.56×	-8.55%	-0.871	-13.04%
2023 – 2024 (AI rally)	133.47%	37.62%	2.340	-24.86%	5.369	25.93%	1.557	-9.97%	2.88×	72.52%	1.557	-28.22%

Calmar = annualized return / |max drawdown|. "Insufficient data" appears when a sub-period has fewer than 30 trading days.

Treynor-Mazuy regressions

The γ column is the timing coefficient. *** = p < 0.01, ** = p < 0.05, * = p < 0.10. t-stats use Newey-West HAC.

Treynor-Mazuy against SPY excess return

Period	α (ann.)	t-stat	β	t-stat	γ (timing)	t-stat	R²	n
Full sample (2016-05 → 2026-05)	51.40%***	(4.58)	1.162***	(6.23)	0.163	(0.05)	33.6%	2,517
2016 – 2019 (pre-COVID)	40.81%***	(2.64)	1.830***	(7.51)	-7.294	(-0.71)	50.7%	910
2020 + (COVID onward)	60.25%***	(4.21)	1.014***	(5.43)	0.477	(0.15)	30.2%	1,606
2020 Q1–Q2 (COVID crash & rebound)	66.34%	(1.13)	0.352**	(2.40)	0.731	(0.58)	13.2%	124
2022 (rate-hike bear market)	17.59%	(0.83)	0.409***	(4.11)	0.277	(0.11)	15.8%	251
2023 – 2024 (AI rally)	4.86%	(0.29)	2.454***	(14.15)	20.653*	(1.70)	70.2%	501

Treynor-Mazuy against leverage-matched benchmark

Period	α (ann.)	t-stat	β	t-stat	γ (timing)	t-stat	R²	n
Full sample (2016-05 → 2026-05)	51.40%***	(4.58)	0.468***	(6.23)	0.027	(0.05)	33.6%	2,517
2016 – 2019 (pre-COVID)	40.81%***	(2.64)	0.656***	(7.51)	-0.936	(-0.71)	50.7%	910
2020 + (COVID onward)	60.25%***	(4.21)	0.440***	(5.43)	0.090	(0.15)	30.2%	1,606
2020 Q1–Q2 (COVID crash & rebound)	66.34%	(1.13)	0.238**	(2.40)	0.334	(0.58)	13.2%	124
2022 (rate-hike bear market)	17.59%	(0.83)	0.725***	(4.11)	0.868	(0.11)	15.8%	251
2023 – 2024 (AI rally)	4.86%	(0.29)	0.852***	(14.15)	2.491*	(1.70)	70.2%	501

Henriksson-Merton regressions

β_down applies when SPY excess return is negative; β_up = β_down + γ applies when it's positive. γ > 0 means the strategy leans into up markets harder than down markets.

Henriksson-Merton against SPY excess return

Period	α (ann.)	t-stat	β_down	β_up	γ (β diff)	t-stat	R²	n
Full sample (2016-05 → 2026-05)	36.77%*	(1.86)	1.083	1.248	0.165	(0.64)	33.7%	2,517
2016 – 2019 (pre-COVID)	22.80%	(0.93)	1.818	1.912	0.094	(0.20)	50.4%	910
2020 + (COVID onward)	41.29%*	(1.68)	0.918	1.115	0.197	(0.73)	30.3%	1,606
2020 Q1–Q2 (COVID crash & rebound)	54.00%	(1.02)	0.292	0.407	0.115	(0.50)	13.2%	124
2022 (rate-hike bear market)	13.43%	(0.69)	0.390	0.428	0.038	(0.18)	15.8%	251
2023 – 2024 (AI rally)	-16.88%	(-0.78)	2.075	2.800	0.725*	(1.92)	70.0%	501

Henriksson-Merton against leverage-matched benchmark

Period	α (ann.)	t-stat	β_down	β_up	γ (β diff)	t-stat	R²	n
Full sample (2016-05 → 2026-05)	36.77%*	(1.86)	0.437	0.503	0.066	(0.64)	33.7%	2,517
2016 – 2019 (pre-COVID)	22.80%	(0.93)	0.651	0.685	0.034	(0.20)	50.4%	910
2020 + (COVID onward)	41.29%*	(1.68)	0.398	0.484	0.085	(0.73)	30.3%	1,606
2020 Q1–Q2 (COVID crash & rebound)	54.00%	(1.02)	0.197	0.275	0.078	(0.50)	13.2%	124
2022 (rate-hike bear market)	13.43%	(0.69)	0.691	0.759	0.068	(0.18)	15.8%	251
2023 – 2024 (AI rally)	-16.88%	(-0.78)	0.721	0.972	0.252*	(1.92)	70.0%	501

CAPM baselines (for reference)

CAPM (linear) against SPY excess return

Period	α (ann.)	t-stat	β	t-stat	R²	n
Full sample (2016-05 → 2026-05)	51.93%***	(5.54)	1.162***	(6.14)	33.6%	2,517
2016 – 2019 (pre-COVID)	29.04%**	(2.38)	1.860***	(7.44)	50.4%	910
2020 + (COVID onward)	62.24%***	(5.01)	1.013***	(5.36)	30.2%	1,606
2020 Q1–Q2 (COVID crash & rebound)	80.84%	(1.33)	0.344**	(2.38)	13.0%	124
2022 (rate-hike bear market)	19.20%	(0.75)	0.409***	(4.14)	15.8%	251
2023 – 2024 (AI rally)	39.27%***	(2.91)	2.443***	(13.38)	69.3%	501

Why α is identical against SPY and against k·SPY

Algebraically: if r_lev − r_f = k·(r_m − r_f), then regressing strategy excess return on this rescaled regressor leaves α unchanged and divides β by k. The same logic gives γ → γ/k² for Treynor-Mazuy and γ → γ/k for Henriksson-Merton, with all t-statistics invariant. The leverage-matched regression tables below are included for completeness. Their α column and γ t-statistics duplicate the SPY tables; only the β and γ point estimates differ in scale.

CAPM against leverage-matched benchmark (2.48× SPY, financed)

Period	α (ann.)	t-stat	β	t-stat	R²	n
Full sample (2016-05 → 2026-05)	51.93%***	(5.54)	0.468***	(6.14)	33.6%	2,517
2016 – 2019 (pre-COVID)	29.04%**	(2.38)	0.666***	(7.44)	50.4%	910
2020 + (COVID onward)	62.24%***	(5.01)	0.439***	(5.36)	30.2%	1,606
2020 Q1–Q2 (COVID crash & rebound)	80.84%	(1.33)	0.232**	(2.38)	13.0%	124
2022 (rate-hike bear market)	19.20%	(0.75)	0.725***	(4.14)	15.8%	251
2023 – 2024 (AI rally)	39.27%***	(2.91)	0.849***	(13.38)	69.3%	501

Classical monthly-frequency tests

Treynor (1966) and Henriksson (1981) used monthly mutual-fund returns. We resample our daily backtest to month-end compounded returns (121 observations) and re-run the regressions. Lower power than daily, but historically the standard frequency for these tests, including for comparability with the published mutual-fund literature.

Model	α (ann.)	t-stat	β	γ	t-stat γ	R²	n months
CAPM	50.81%***	(4.89)	1.330	·	·	33.1%	121
Treynor-Mazuy	38.74%***	(3.52)	1.332	4.989*	(1.75)	35.3%	121
Henriksson-Merton	26.04%*	(1.89)	β_d=0.728 β_u=1.908	1.180**	(2.04)	35.8%	121

Monthly excess returns: strategy vs SPY

Each dot is one month. Curve = Treynor-Mazuy quadratic fit (γ = 4.989). Dashed line = the linear CAPM fit (γ = 0). Upward-curving = positive timing; flat = pure beta; downward-curving = anti-timing.

How to read this

The CAPM line says: "for every 1% the market moves up, the strategy moves β%." Treynor-Mazuy adds a curvature term. If γ is positive and meaningful, the strategy's actual response steepens as the market moves further from zero, exactly what active market-timing produces.

A levered always-on long position would show a straight line with slope = average leverage. A negative γ at the monthly frequency would mean a "short-volatility" payoff profile, earning small amounts most of the time and giving it all back in tail months.

Findings

Sharpe gap = +1.16 (strategy 1.84 vs leverage-matched 0.68). Because Sharpe of k·SPY − (k−1)·rf mathematically equals Sharpe of SPY, this gap is the skill-beyond-leverage premium. A pure levered long would deliver 0.68; the strategy delivers 1.84.
Annualized return gap = 61.25% (88.79% strategy vs 27.54% static 2.48× SPY financed). Even net of financing cost, a passive 2.48× position would return 27.54% per year. The strategy outperforms by 61.25% per year.
Drawdown comparison: strategy max DD = -29.28%; static 2.48× SPY max DD = -68.18%. Strategy Calmar = 3.033, static 2.48× Calmar = 0.404. Risk-adjusted by drawdown, the strategy is 7.5× more efficient.
CAPM α = 51.93% annualized (p = <0.001, HAC SE). This is the daily-frequency α controlling for SPY beta, it is numerically identical whether benchmarked against raw SPY or k·SPY financed (α is invariant to riskless rescaling of the regressor). What does change is β: 1.16 vs SPY, much lower than the 2.48× average target leverage. Re-scaled against the k·SPY benchmark, β = 0.47, well below 1. This is the vol-targeting fingerprint: the strategy systematically reduces exposure on volatile days, so its daily co-movement with the market is roughly half what static 2.48× SPY would produce.
Daily T-M γ = 0.163 (t = 0.05, p = 0.962), no detectable daily timing convexity. At daily frequency, the strategy's payoff is approximately linear in the market return (after controlling for β). The skill shows up in α, not in γ.
Monthly T-M γ = 4.989 (t = 1.75, p = 0.083), marginal evidence (10% level) of convex payoff at monthly horizon. Weaker than the daily R² but in the right direction.
Monthly Henriksson-Merton γ = 1.180 (t = 2.04, p = 0.043), significant at 5%. β_down = 0.73, β_up = 1.91. The strategy carries 1.18 more units of market exposure in up months than down months. This is the classical signature of positive market-timing skill, Henriksson-Merton's original test on monthly data.

How to read these comparisons

The leverage-matched comparison is the correct one for descriptive statistics (Sharpe, total return, max DD, Calmar). The relevant gap is in the performance table above: strategy Sharpe 1.84 versus static 2.48× SPY Sharpe 0.68.

For regression-based tests, the choice of SPY versus k·SPY does not affect α or the t-statistic on γ. The leverage-matched benchmark is a scaled version of SPY excess return, so OLS rescales β and γ but α and t-statistics are invariant. The question of whether skill exists beyond leverage is therefore answered by the Sharpe gap, not by regressing against k·SPY.

Sub-period stability is the second important check. A strategy with positive γ in calm regimes (2016 to 2019, 2023 to 2024) and negative γ in stress (2020 Q1, 2022) is a short-vol trade in disguise. A strategy with γ through 2020 Q1 and 2022 that does not collapse is exhibiting genuine market timing.

Limitations

10-year sample only. The dataset does not include the 2000 dot-com bust, the 2008 financial crisis, or the 1970s stagflation. Strategy parameters were tuned over many iterations on overlapping data, so a Deflated Sharpe penalty applies and live performance should be expected to be lower than the in-sample numbers shown here.
Single-asset backtest. The walk-forward backtest uses SPY only. Live execution maps the same signal across SGOV / SPY / SSO / UPRO via a piecewise blend, which introduces ETF-specific tracking error and intra-day rebalance slippage that this analysis does not capture.
Risk-free rate is a flat 4%. Real Treasury rates were below 1% from 2016 through early 2022 and above 5% from late 2023. A time-varying rate (such as the 3-month T-bill from FRED) would shift the leverage-matched benchmark slightly, though not materially, because financing cost on the (k − 1) borrowed dollars is small at the average exposure.
Asymptotic inference. Newey-West HAC controls for serial correlation but assumes the autocorrelation structure is well-approximated by the chosen lag length. With ~2,500 daily observations the asymptotics are reliable; with ~120 monthly observations they are weaker.
Multiple-testing concern. The strategy has been iterated hundreds of times on this exact dataset. The walk-forward harness mitigates but does not eliminate this. The Deflated Sharpe Ratio in the evaluation harness applies the formal correction.

Replication

All numbers in this report are produced by analysis/timing_tests.py in the nightclaude repository. The full pipeline runs in under a second: load SPY data, run the walk-forward backtest, compute the regressions, and render the HTML. The strategy code is whatever strategy.py contains at the time of the run; the backtest configuration uses StrategyConfig() defaults.

Treynor & Mazuy (1966), "Can Mutual Funds Outguess the Market?", Harvard Business Review 44, 131-136.
Henriksson & Merton (1981), "On Market Timing and Investment Performance. II. Statistical Procedures for Evaluating Forecasting Skills", Journal of Business 54, 513-533.

Report generated 2026-05-23 22:48. Backtest window: 2016-05-18 to 2026-05-22.

Does it actually timethe market?