nightclaude.

Does it actually time
the market?

This report applies two classical market-timing regressions to the nightclaude walk-forward backtest: Treynor-Mazuy (1966) and Henriksson-Merton (1981). The strategy is benchmarked against both raw SPY and a leverage-matched SPY position that pays daily financing on the borrowed leg, so leverage and skill can be separated. Sub-period breakdowns isolate the COVID shock, the 2022 bear, and the 2023 to 2024 AI rally.

Sample
2,517 days
2016-05-18 → 2026-05-22
Avg target leverage
2.48×
CAPM β = 1.16 (so realized exposure ≈ target)
Sharpe vs leverage-matched
1.84 → 0.68
Strategy delivers 2.69× the Sharpe of a static 2.48× position
CAPM α (annualized)
51.9%
p = <0.001 · HAC SE · invariant to k·SPY rescaling

Data window

The standard academic regime split (pre-2008 / 2009 to 2019 / 2020+) is not feasible with this dataset. The SPY history available to the backtest runs 2016-05-18 to 2026-05-22: no pre-2008 data, and the 2009 to 2019 window is only covered from 2016 onward. Sub-period analysis is therefore restricted to what the backtest can observe: full sample, 2016 to 2019 (pre-COVID), 2020+ (COVID onward), plus three finer regime cuts: 2020 Q1 to Q2 (the actual crash), 2022 rate-hike bear, and 2023 to 2024 AI rally. Pre-2008 conclusions would require data that does not exist in this cache.

Methodology

Three regressions are estimated for each sub-period, on daily excess returns (annual risk-free rate = 4%, matching the evaluation harness). Standard errors are Newey-West HAC with lag selection L = ⌊4·(n/100)2/9⌋ to handle serial correlation in daily returns.

CAPM (baseline)

rp − rf = α + β·(rm − rf) + ε

α is the unconditional excess return after controlling for market exposure.

Treynor-Mazuy (1966)

rp − rf = α + β·(rm − rf) + γ·(rm − rf)2 + ε

γ > 0 means the payoff is convex in market return, exposure rises when the market is strong, falls when it's weak. This is the classical signature of positive market-timing skill.

Henriksson-Merton (1981)

rp − rf = α + β·(rm − rf) + γ·max(0, rm − rf) + ε

Equivalent to letting beta differ between up and down markets: βdown = β, βup = β + γ. γ > 0 means higher beta in up markets.

Leverage-matched benchmark

Average target exposure to SPY is 2.48×. The corresponding passive comparator is rlev = 2.48·rSPY − (2.48 − 1)·rf: a static 2.48× SPY position that pays daily financing on the 1.48 borrowed dollars. For descriptive statistics (Sharpe, total return, max DD, Calmar) this is the appropriate comparator and the skill-beyond-leverage premium is visible directly. For the regression α, this benchmark is mathematically equivalent to SPY: rescaling the regressor by k leaves α unchanged and divides β by k. The Sharpe gap, not the regression α, answers the question of whether skill exists beyond leverage.

Equity curve, drawdowns, exposure

Cumulative growth of $1 (log scale)

10×100×20162017201820192020202120222023202420252026 strategy (net of cost) SPY (1×) 2.48× SPY financed
Strategy in terracotta, SPY (1×) navy dashed, leverage-matched (2.48× SPY financed) muted dashed. Log axis so multiplicative differences are visually proportional.

Drawdown comparison

0%-20%-40%-60%20162017201820192020202120222023202420252026
Drawdown from rolling peak. Strategy drawdowns include slippage / spread / commission costs.

Target daily exposure (0× to 4× SPY)

20162017201820192020202120222023202420252026 avg 2.48×
Average target exposure over the full sample is 2.48×. The signal is clipped to max_position_pct = 4.0 by the position sizer. Days at the 4× cap: 1,200 (47.7% of days).

Performance summary

All strategy metrics are net of transaction costs (commission $0.005/share, spread 5 bps, slippage 3 bps, applied on every rebalance). The leverage-matched benchmark uses the period's actual average target leverage.

Period Strategy (net of cost) SPY (1×) avg lev Leverage-matched (financed)
ann retann volSharpemax DDCalmar ann retSharpemax DD ann retSharpemax DD
Full sample (2016-05 → 2026-05) 88.79% 35.91% 1.843 -29.28% 3.033 15.68% 0.684 -33.72% 2.48× 27.54% 0.684 -68.18%
2016 – 2019 (pre-COVID) 62.45% 32.77% 1.529 -29.28% 2.133 15.55% 0.906 -19.35% 2.79× 34.13% 0.906 -47.70%
2020 + (COVID onward) 105.35% 37.58% 2.000 -24.86% 4.238 15.72% 0.626 -33.72% 2.31× 24.85% 0.626 -64.99%
2020 Q1–Q2 (COVID crash & rebound) 111.27% 42.68% 1.869 -21.11% 5.272 -8.80% -0.069 -33.72% 1.48× -20.36% -0.069 -46.86%
2022 (rate-hike bear market) 12.11% 24.98% 0.423 -16.39% 0.739 -18.24% -0.871 -24.50% 0.56× -8.55% -0.871 -13.04%
2023 – 2024 (AI rally) 133.47% 37.62% 2.340 -24.86% 5.369 25.93% 1.557 -9.97% 2.88× 72.52% 1.557 -28.22%

Calmar = annualized return / |max drawdown|. "Insufficient data" appears when a sub-period has fewer than 30 trading days.

Treynor-Mazuy regressions

The γ column is the timing coefficient. *** = p < 0.01, ** = p < 0.05, * = p < 0.10. t-stats use Newey-West HAC.

Treynor-Mazuy against SPY excess return

Period α (ann.) t-stat β t-stat γ (timing) t-stat n
Full sample (2016-05 → 2026-05) 51.40%*** (4.58) 1.162*** (6.23) 0.163 (0.05) 33.6% 2,517
2016 – 2019 (pre-COVID) 40.81%*** (2.64) 1.830*** (7.51) -7.294 (-0.71) 50.7% 910
2020 + (COVID onward) 60.25%*** (4.21) 1.014*** (5.43) 0.477 (0.15) 30.2% 1,606
2020 Q1–Q2 (COVID crash & rebound) 66.34% (1.13) 0.352** (2.40) 0.731 (0.58) 13.2% 124
2022 (rate-hike bear market) 17.59% (0.83) 0.409*** (4.11) 0.277 (0.11) 15.8% 251
2023 – 2024 (AI rally) 4.86% (0.29) 2.454*** (14.15) 20.653* (1.70) 70.2% 501

Treynor-Mazuy against leverage-matched benchmark

Period α (ann.) t-stat β t-stat γ (timing) t-stat n
Full sample (2016-05 → 2026-05) 51.40%*** (4.58) 0.468*** (6.23) 0.027 (0.05) 33.6% 2,517
2016 – 2019 (pre-COVID) 40.81%*** (2.64) 0.656*** (7.51) -0.936 (-0.71) 50.7% 910
2020 + (COVID onward) 60.25%*** (4.21) 0.440*** (5.43) 0.090 (0.15) 30.2% 1,606
2020 Q1–Q2 (COVID crash & rebound) 66.34% (1.13) 0.238** (2.40) 0.334 (0.58) 13.2% 124
2022 (rate-hike bear market) 17.59% (0.83) 0.725*** (4.11) 0.868 (0.11) 15.8% 251
2023 – 2024 (AI rally) 4.86% (0.29) 0.852*** (14.15) 2.491* (1.70) 70.2% 501

Henriksson-Merton regressions

βdown applies when SPY excess return is negative; βup = βdown + γ applies when it's positive. γ > 0 means the strategy leans into up markets harder than down markets.

Henriksson-Merton against SPY excess return

Period α (ann.) t-stat βdown βup γ (β diff) t-stat n
Full sample (2016-05 → 2026-05) 36.77%* (1.86) 1.083 1.248 0.165 (0.64) 33.7% 2,517
2016 – 2019 (pre-COVID) 22.80% (0.93) 1.818 1.912 0.094 (0.20) 50.4% 910
2020 + (COVID onward) 41.29%* (1.68) 0.918 1.115 0.197 (0.73) 30.3% 1,606
2020 Q1–Q2 (COVID crash & rebound) 54.00% (1.02) 0.292 0.407 0.115 (0.50) 13.2% 124
2022 (rate-hike bear market) 13.43% (0.69) 0.390 0.428 0.038 (0.18) 15.8% 251
2023 – 2024 (AI rally) -16.88% (-0.78) 2.075 2.800 0.725* (1.92) 70.0% 501

Henriksson-Merton against leverage-matched benchmark

Period α (ann.) t-stat βdown βup γ (β diff) t-stat n
Full sample (2016-05 → 2026-05) 36.77%* (1.86) 0.437 0.503 0.066 (0.64) 33.7% 2,517
2016 – 2019 (pre-COVID) 22.80% (0.93) 0.651 0.685 0.034 (0.20) 50.4% 910
2020 + (COVID onward) 41.29%* (1.68) 0.398 0.484 0.085 (0.73) 30.3% 1,606
2020 Q1–Q2 (COVID crash & rebound) 54.00% (1.02) 0.197 0.275 0.078 (0.50) 13.2% 124
2022 (rate-hike bear market) 13.43% (0.69) 0.691 0.759 0.068 (0.18) 15.8% 251
2023 – 2024 (AI rally) -16.88% (-0.78) 0.721 0.972 0.252* (1.92) 70.0% 501

CAPM baselines (for reference)

CAPM (linear) against SPY excess return

Period α (ann.) t-stat β t-stat n
Full sample (2016-05 → 2026-05) 51.93%*** (5.54) 1.162*** (6.14) 33.6% 2,517
2016 – 2019 (pre-COVID) 29.04%** (2.38) 1.860*** (7.44) 50.4% 910
2020 + (COVID onward) 62.24%*** (5.01) 1.013*** (5.36) 30.2% 1,606
2020 Q1–Q2 (COVID crash & rebound) 80.84% (1.33) 0.344** (2.38) 13.0% 124
2022 (rate-hike bear market) 19.20% (0.75) 0.409*** (4.14) 15.8% 251
2023 – 2024 (AI rally) 39.27%*** (2.91) 2.443*** (13.38) 69.3% 501

Why α is identical against SPY and against k·SPY

Algebraically: if rlev − rf = k·(rm − rf), then regressing strategy excess return on this rescaled regressor leaves α unchanged and divides β by k. The same logic gives γ → γ/k² for Treynor-Mazuy and γ → γ/k for Henriksson-Merton, with all t-statistics invariant. The leverage-matched regression tables below are included for completeness. Their α column and γ t-statistics duplicate the SPY tables; only the β and γ point estimates differ in scale.

CAPM against leverage-matched benchmark (2.48× SPY, financed)

Period α (ann.) t-stat β t-stat n
Full sample (2016-05 → 2026-05) 51.93%*** (5.54) 0.468*** (6.14) 33.6% 2,517
2016 – 2019 (pre-COVID) 29.04%** (2.38) 0.666*** (7.44) 50.4% 910
2020 + (COVID onward) 62.24%*** (5.01) 0.439*** (5.36) 30.2% 1,606
2020 Q1–Q2 (COVID crash & rebound) 80.84% (1.33) 0.232** (2.38) 13.0% 124
2022 (rate-hike bear market) 19.20% (0.75) 0.725*** (4.14) 15.8% 251
2023 – 2024 (AI rally) 39.27%*** (2.91) 0.849*** (13.38) 69.3% 501

Classical monthly-frequency tests

Treynor (1966) and Henriksson (1981) used monthly mutual-fund returns. We resample our daily backtest to month-end compounded returns (121 observations) and re-run the regressions. Lower power than daily, but historically the standard frequency for these tests, including for comparability with the published mutual-fund literature.

Modelα (ann.)t-statβγt-stat γn months
CAPM 50.81%*** (4.89) 1.330 · · 33.1% 121
Treynor-Mazuy 38.74%*** (3.52) 1.332 4.989* (1.75) 35.3% 121
Henriksson-Merton 26.04%* (1.89) βd=0.728
βu=1.908
1.180** (2.04) 35.8% 121

Monthly excess returns: strategy vs SPY

-13.5%-6.8%-0.2%6.4%13.0%-19.2%-4.1%11.0%26.1%41.1%SPY monthly excess returnstrategy monthly excess return T-M quadratic fit CAPM linear (γ=0)
Each dot is one month. Curve = Treynor-Mazuy quadratic fit (γ = 4.989). Dashed line = the linear CAPM fit (γ = 0). Upward-curving = positive timing; flat = pure beta; downward-curving = anti-timing.

How to read this

The CAPM line says: "for every 1% the market moves up, the strategy moves β%." Treynor-Mazuy adds a curvature term. If γ is positive and meaningful, the strategy's actual response steepens as the market moves further from zero, exactly what active market-timing produces.

A levered always-on long position would show a straight line with slope = average leverage. A negative γ at the monthly frequency would mean a "short-volatility" payoff profile, earning small amounts most of the time and giving it all back in tail months.

Findings

  • Sharpe gap = +1.16 (strategy 1.84 vs leverage-matched 0.68). Because Sharpe of k·SPY − (k−1)·rf mathematically equals Sharpe of SPY, this gap is the skill-beyond-leverage premium. A pure levered long would deliver 0.68; the strategy delivers 1.84.
  • Annualized return gap = 61.25% (88.79% strategy vs 27.54% static 2.48× SPY financed). Even net of financing cost, a passive 2.48× position would return 27.54% per year. The strategy outperforms by 61.25% per year.
  • Drawdown comparison: strategy max DD = -29.28%; static 2.48× SPY max DD = -68.18%. Strategy Calmar = 3.033, static 2.48× Calmar = 0.404. Risk-adjusted by drawdown, the strategy is 7.5× more efficient.
  • CAPM α = 51.93% annualized (p = <0.001, HAC SE). This is the daily-frequency α controlling for SPY beta, it is numerically identical whether benchmarked against raw SPY or k·SPY financed (α is invariant to riskless rescaling of the regressor). What does change is β: 1.16 vs SPY, much lower than the 2.48× average target leverage. Re-scaled against the k·SPY benchmark, β = 0.47, well below 1. This is the vol-targeting fingerprint: the strategy systematically reduces exposure on volatile days, so its daily co-movement with the market is roughly half what static 2.48× SPY would produce.
  • Daily T-M γ = 0.163 (t = 0.05, p = 0.962), no detectable daily timing convexity. At daily frequency, the strategy's payoff is approximately linear in the market return (after controlling for β). The skill shows up in α, not in γ.
  • Monthly T-M γ = 4.989 (t = 1.75, p = 0.083), marginal evidence (10% level) of convex payoff at monthly horizon. Weaker than the daily R² but in the right direction.
  • Monthly Henriksson-Merton γ = 1.180 (t = 2.04, p = 0.043), significant at 5%. βdown = 0.73, βup = 1.91. The strategy carries 1.18 more units of market exposure in up months than down months. This is the classical signature of positive market-timing skill, Henriksson-Merton's original test on monthly data.

How to read these comparisons

The leverage-matched comparison is the correct one for descriptive statistics (Sharpe, total return, max DD, Calmar). The relevant gap is in the performance table above: strategy Sharpe 1.84 versus static 2.48× SPY Sharpe 0.68.

For regression-based tests, the choice of SPY versus k·SPY does not affect α or the t-statistic on γ. The leverage-matched benchmark is a scaled version of SPY excess return, so OLS rescales β and γ but α and t-statistics are invariant. The question of whether skill exists beyond leverage is therefore answered by the Sharpe gap, not by regressing against k·SPY.

Sub-period stability is the second important check. A strategy with positive γ in calm regimes (2016 to 2019, 2023 to 2024) and negative γ in stress (2020 Q1, 2022) is a short-vol trade in disguise. A strategy with γ through 2020 Q1 and 2022 that does not collapse is exhibiting genuine market timing.

Limitations

  • 10-year sample only. The dataset does not include the 2000 dot-com bust, the 2008 financial crisis, or the 1970s stagflation. Strategy parameters were tuned over many iterations on overlapping data, so a Deflated Sharpe penalty applies and live performance should be expected to be lower than the in-sample numbers shown here.
  • Single-asset backtest. The walk-forward backtest uses SPY only. Live execution maps the same signal across SGOV / SPY / SSO / UPRO via a piecewise blend, which introduces ETF-specific tracking error and intra-day rebalance slippage that this analysis does not capture.
  • Risk-free rate is a flat 4%. Real Treasury rates were below 1% from 2016 through early 2022 and above 5% from late 2023. A time-varying rate (such as the 3-month T-bill from FRED) would shift the leverage-matched benchmark slightly, though not materially, because financing cost on the (k − 1) borrowed dollars is small at the average exposure.
  • Asymptotic inference. Newey-West HAC controls for serial correlation but assumes the autocorrelation structure is well-approximated by the chosen lag length. With ~2,500 daily observations the asymptotics are reliable; with ~120 monthly observations they are weaker.
  • Multiple-testing concern. The strategy has been iterated hundreds of times on this exact dataset. The walk-forward harness mitigates but does not eliminate this. The Deflated Sharpe Ratio in the evaluation harness applies the formal correction.

Replication

All numbers in this report are produced by analysis/timing_tests.py in the nightclaude repository. The full pipeline runs in under a second: load SPY data, run the walk-forward backtest, compute the regressions, and render the HTML. The strategy code is whatever strategy.py contains at the time of the run; the backtest configuration uses StrategyConfig() defaults.

Treynor & Mazuy (1966), "Can Mutual Funds Outguess the Market?", Harvard Business Review 44, 131-136.
Henriksson & Merton (1981), "On Market Timing and Investment Performance. II. Statistical Procedures for Evaluating Forecasting Skills", Journal of Business 54, 513-533.

Report generated 2026-05-23 22:48. Backtest window: 2016-05-18 to 2026-05-22.