Does it actually time
the market?
This report applies two classical market-timing regressions to the nightclaude walk-forward backtest: Treynor-Mazuy (1966) and Henriksson-Merton (1981). The strategy is benchmarked against both raw SPY and a leverage-matched SPY position that pays daily financing on the borrowed leg, so leverage and skill can be separated. Sub-period breakdowns isolate the COVID shock, the 2022 bear, and the 2023 to 2024 AI rally.
Data window
The standard academic regime split (pre-2008 / 2009 to 2019 / 2020+) is not feasible with this dataset. The SPY history available to the backtest runs 2016-05-18 to 2026-05-22: no pre-2008 data, and the 2009 to 2019 window is only covered from 2016 onward. Sub-period analysis is therefore restricted to what the backtest can observe: full sample, 2016 to 2019 (pre-COVID), 2020+ (COVID onward), plus three finer regime cuts: 2020 Q1 to Q2 (the actual crash), 2022 rate-hike bear, and 2023 to 2024 AI rally. Pre-2008 conclusions would require data that does not exist in this cache.
Methodology
Three regressions are estimated for each sub-period, on daily excess returns (annual risk-free rate = 4%, matching the evaluation harness). Standard errors are Newey-West HAC with lag selection L = ⌊4·(n/100)2/9⌋ to handle serial correlation in daily returns.
CAPM (baseline)
α is the unconditional excess return after controlling for market exposure.
Treynor-Mazuy (1966)
γ > 0 means the payoff is convex in market return, exposure rises when the market is strong, falls when it's weak. This is the classical signature of positive market-timing skill.
Henriksson-Merton (1981)
Equivalent to letting beta differ between up and down markets: βdown = β, βup = β + γ. γ > 0 means higher beta in up markets.
Leverage-matched benchmark
Average target exposure to SPY is 2.48×. The corresponding passive
comparator is
rlev = 2.48·rSPY − (2.48 − 1)·rf:
a static 2.48× SPY position that pays daily financing on the 1.48
borrowed dollars. For descriptive statistics (Sharpe, total return, max DD,
Calmar) this is the appropriate comparator and the skill-beyond-leverage premium is
visible directly. For the regression α, this benchmark is mathematically
equivalent to SPY: rescaling the regressor by k leaves α unchanged and divides β by k.
The Sharpe gap, not the regression α, answers the question of whether skill exists
beyond leverage.
Equity curve, drawdowns, exposure
Cumulative growth of $1 (log scale)
Drawdown comparison
Target daily exposure (0× to 4× SPY)
max_position_pct = 4.0 by the position sizer. Days at the 4× cap: 1,200 (47.7% of days).
Performance summary
All strategy metrics are net of transaction costs (commission $0.005/share, spread 5 bps, slippage 3 bps, applied on every rebalance). The leverage-matched benchmark uses the period's actual average target leverage.
| Period | Strategy (net of cost) | SPY (1×) | avg lev | Leverage-matched (financed) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ann ret | ann vol | Sharpe | max DD | Calmar | ann ret | Sharpe | max DD | ann ret | Sharpe | max DD | ||
| Full sample (2016-05 → 2026-05) | 88.79% | 35.91% | 1.843 | -29.28% | 3.033 | 15.68% | 0.684 | -33.72% | 2.48× | 27.54% | 0.684 | -68.18% |
| 2016 – 2019 (pre-COVID) | 62.45% | 32.77% | 1.529 | -29.28% | 2.133 | 15.55% | 0.906 | -19.35% | 2.79× | 34.13% | 0.906 | -47.70% |
| 2020 + (COVID onward) | 105.35% | 37.58% | 2.000 | -24.86% | 4.238 | 15.72% | 0.626 | -33.72% | 2.31× | 24.85% | 0.626 | -64.99% |
| 2020 Q1–Q2 (COVID crash & rebound) | 111.27% | 42.68% | 1.869 | -21.11% | 5.272 | -8.80% | -0.069 | -33.72% | 1.48× | -20.36% | -0.069 | -46.86% |
| 2022 (rate-hike bear market) | 12.11% | 24.98% | 0.423 | -16.39% | 0.739 | -18.24% | -0.871 | -24.50% | 0.56× | -8.55% | -0.871 | -13.04% |
| 2023 – 2024 (AI rally) | 133.47% | 37.62% | 2.340 | -24.86% | 5.369 | 25.93% | 1.557 | -9.97% | 2.88× | 72.52% | 1.557 | -28.22% |
Calmar = annualized return / |max drawdown|. "Insufficient data" appears when a sub-period has fewer than 30 trading days.
Treynor-Mazuy regressions
The γ column is the timing coefficient. *** = p < 0.01, ** = p < 0.05, * = p < 0.10. t-stats use Newey-West HAC.
Treynor-Mazuy against SPY excess return
| Period | α (ann.) | t-stat | β | t-stat | γ (timing) | t-stat | R² | n |
|---|---|---|---|---|---|---|---|---|
| Full sample (2016-05 → 2026-05) | 51.40%*** | (4.58) | 1.162*** | (6.23) | 0.163 | (0.05) | 33.6% | 2,517 |
| 2016 – 2019 (pre-COVID) | 40.81%*** | (2.64) | 1.830*** | (7.51) | -7.294 | (-0.71) | 50.7% | 910 |
| 2020 + (COVID onward) | 60.25%*** | (4.21) | 1.014*** | (5.43) | 0.477 | (0.15) | 30.2% | 1,606 |
| 2020 Q1–Q2 (COVID crash & rebound) | 66.34% | (1.13) | 0.352** | (2.40) | 0.731 | (0.58) | 13.2% | 124 |
| 2022 (rate-hike bear market) | 17.59% | (0.83) | 0.409*** | (4.11) | 0.277 | (0.11) | 15.8% | 251 |
| 2023 – 2024 (AI rally) | 4.86% | (0.29) | 2.454*** | (14.15) | 20.653* | (1.70) | 70.2% | 501 |
Treynor-Mazuy against leverage-matched benchmark
| Period | α (ann.) | t-stat | β | t-stat | γ (timing) | t-stat | R² | n |
|---|---|---|---|---|---|---|---|---|
| Full sample (2016-05 → 2026-05) | 51.40%*** | (4.58) | 0.468*** | (6.23) | 0.027 | (0.05) | 33.6% | 2,517 |
| 2016 – 2019 (pre-COVID) | 40.81%*** | (2.64) | 0.656*** | (7.51) | -0.936 | (-0.71) | 50.7% | 910 |
| 2020 + (COVID onward) | 60.25%*** | (4.21) | 0.440*** | (5.43) | 0.090 | (0.15) | 30.2% | 1,606 |
| 2020 Q1–Q2 (COVID crash & rebound) | 66.34% | (1.13) | 0.238** | (2.40) | 0.334 | (0.58) | 13.2% | 124 |
| 2022 (rate-hike bear market) | 17.59% | (0.83) | 0.725*** | (4.11) | 0.868 | (0.11) | 15.8% | 251 |
| 2023 – 2024 (AI rally) | 4.86% | (0.29) | 0.852*** | (14.15) | 2.491* | (1.70) | 70.2% | 501 |
Henriksson-Merton regressions
βdown applies when SPY excess return is negative; βup = βdown + γ applies when it's positive. γ > 0 means the strategy leans into up markets harder than down markets.
Henriksson-Merton against SPY excess return
| Period | α (ann.) | t-stat | βdown | βup | γ (β diff) | t-stat | R² | n |
|---|---|---|---|---|---|---|---|---|
| Full sample (2016-05 → 2026-05) | 36.77%* | (1.86) | 1.083 | 1.248 | 0.165 | (0.64) | 33.7% | 2,517 |
| 2016 – 2019 (pre-COVID) | 22.80% | (0.93) | 1.818 | 1.912 | 0.094 | (0.20) | 50.4% | 910 |
| 2020 + (COVID onward) | 41.29%* | (1.68) | 0.918 | 1.115 | 0.197 | (0.73) | 30.3% | 1,606 |
| 2020 Q1–Q2 (COVID crash & rebound) | 54.00% | (1.02) | 0.292 | 0.407 | 0.115 | (0.50) | 13.2% | 124 |
| 2022 (rate-hike bear market) | 13.43% | (0.69) | 0.390 | 0.428 | 0.038 | (0.18) | 15.8% | 251 |
| 2023 – 2024 (AI rally) | -16.88% | (-0.78) | 2.075 | 2.800 | 0.725* | (1.92) | 70.0% | 501 |
Henriksson-Merton against leverage-matched benchmark
| Period | α (ann.) | t-stat | βdown | βup | γ (β diff) | t-stat | R² | n |
|---|---|---|---|---|---|---|---|---|
| Full sample (2016-05 → 2026-05) | 36.77%* | (1.86) | 0.437 | 0.503 | 0.066 | (0.64) | 33.7% | 2,517 |
| 2016 – 2019 (pre-COVID) | 22.80% | (0.93) | 0.651 | 0.685 | 0.034 | (0.20) | 50.4% | 910 |
| 2020 + (COVID onward) | 41.29%* | (1.68) | 0.398 | 0.484 | 0.085 | (0.73) | 30.3% | 1,606 |
| 2020 Q1–Q2 (COVID crash & rebound) | 54.00% | (1.02) | 0.197 | 0.275 | 0.078 | (0.50) | 13.2% | 124 |
| 2022 (rate-hike bear market) | 13.43% | (0.69) | 0.691 | 0.759 | 0.068 | (0.18) | 15.8% | 251 |
| 2023 – 2024 (AI rally) | -16.88% | (-0.78) | 0.721 | 0.972 | 0.252* | (1.92) | 70.0% | 501 |
CAPM baselines (for reference)
CAPM (linear) against SPY excess return
| Period | α (ann.) | t-stat | β | t-stat | R² | n |
|---|---|---|---|---|---|---|
| Full sample (2016-05 → 2026-05) | 51.93%*** | (5.54) | 1.162*** | (6.14) | 33.6% | 2,517 |
| 2016 – 2019 (pre-COVID) | 29.04%** | (2.38) | 1.860*** | (7.44) | 50.4% | 910 |
| 2020 + (COVID onward) | 62.24%*** | (5.01) | 1.013*** | (5.36) | 30.2% | 1,606 |
| 2020 Q1–Q2 (COVID crash & rebound) | 80.84% | (1.33) | 0.344** | (2.38) | 13.0% | 124 |
| 2022 (rate-hike bear market) | 19.20% | (0.75) | 0.409*** | (4.14) | 15.8% | 251 |
| 2023 – 2024 (AI rally) | 39.27%*** | (2.91) | 2.443*** | (13.38) | 69.3% | 501 |
Why α is identical against SPY and against k·SPY
Algebraically: if rlev − rf = k·(rm − rf), then regressing strategy excess return on this rescaled regressor leaves α unchanged and divides β by k. The same logic gives γ → γ/k² for Treynor-Mazuy and γ → γ/k for Henriksson-Merton, with all t-statistics invariant. The leverage-matched regression tables below are included for completeness. Their α column and γ t-statistics duplicate the SPY tables; only the β and γ point estimates differ in scale.
CAPM against leverage-matched benchmark (2.48× SPY, financed)
| Period | α (ann.) | t-stat | β | t-stat | R² | n |
|---|---|---|---|---|---|---|
| Full sample (2016-05 → 2026-05) | 51.93%*** | (5.54) | 0.468*** | (6.14) | 33.6% | 2,517 |
| 2016 – 2019 (pre-COVID) | 29.04%** | (2.38) | 0.666*** | (7.44) | 50.4% | 910 |
| 2020 + (COVID onward) | 62.24%*** | (5.01) | 0.439*** | (5.36) | 30.2% | 1,606 |
| 2020 Q1–Q2 (COVID crash & rebound) | 80.84% | (1.33) | 0.232** | (2.38) | 13.0% | 124 |
| 2022 (rate-hike bear market) | 19.20% | (0.75) | 0.725*** | (4.14) | 15.8% | 251 |
| 2023 – 2024 (AI rally) | 39.27%*** | (2.91) | 0.849*** | (13.38) | 69.3% | 501 |
Classical monthly-frequency tests
Treynor (1966) and Henriksson (1981) used monthly mutual-fund returns. We resample our daily backtest to month-end compounded returns (121 observations) and re-run the regressions. Lower power than daily, but historically the standard frequency for these tests, including for comparability with the published mutual-fund literature.
| Model | α (ann.) | t-stat | β | γ | t-stat γ | R² | n months |
|---|---|---|---|---|---|---|---|
| CAPM | 50.81%*** | (4.89) | 1.330 | · | · | 33.1% | 121 |
| Treynor-Mazuy | 38.74%*** | (3.52) | 1.332 | 4.989* | (1.75) | 35.3% | 121 |
| Henriksson-Merton | 26.04%* | (1.89) | βd=0.728 βu=1.908 |
1.180** | (2.04) | 35.8% | 121 |
Monthly excess returns: strategy vs SPY
How to read this
The CAPM line says: "for every 1% the market moves up, the strategy moves β%." Treynor-Mazuy adds a curvature term. If γ is positive and meaningful, the strategy's actual response steepens as the market moves further from zero, exactly what active market-timing produces.
A levered always-on long position would show a straight line with slope = average leverage. A negative γ at the monthly frequency would mean a "short-volatility" payoff profile, earning small amounts most of the time and giving it all back in tail months.
Findings
- Sharpe gap = +1.16 (strategy 1.84 vs leverage-matched 0.68). Because Sharpe of
k·SPY − (k−1)·rfmathematically equals Sharpe of SPY, this gap is the skill-beyond-leverage premium. A pure levered long would deliver 0.68; the strategy delivers 1.84. - Annualized return gap = 61.25% (88.79% strategy vs 27.54% static 2.48× SPY financed). Even net of financing cost, a passive 2.48× position would return 27.54% per year. The strategy outperforms by 61.25% per year.
- Drawdown comparison: strategy max DD = -29.28%; static 2.48× SPY max DD = -68.18%. Strategy Calmar = 3.033, static 2.48× Calmar = 0.404. Risk-adjusted by drawdown, the strategy is 7.5× more efficient.
- CAPM α = 51.93% annualized (p = <0.001, HAC SE). This is the daily-frequency α controlling for SPY beta, it is numerically identical whether benchmarked against raw SPY or k·SPY financed (α is invariant to riskless rescaling of the regressor). What does change is β: 1.16 vs SPY, much lower than the 2.48× average target leverage. Re-scaled against the k·SPY benchmark, β = 0.47, well below 1. This is the vol-targeting fingerprint: the strategy systematically reduces exposure on volatile days, so its daily co-movement with the market is roughly half what static 2.48× SPY would produce.
- Daily T-M γ = 0.163 (t = 0.05, p = 0.962), no detectable daily timing convexity. At daily frequency, the strategy's payoff is approximately linear in the market return (after controlling for β). The skill shows up in α, not in γ.
- Monthly T-M γ = 4.989 (t = 1.75, p = 0.083), marginal evidence (10% level) of convex payoff at monthly horizon. Weaker than the daily R² but in the right direction.
- Monthly Henriksson-Merton γ = 1.180 (t = 2.04, p = 0.043), significant at 5%. βdown = 0.73, βup = 1.91. The strategy carries 1.18 more units of market exposure in up months than down months. This is the classical signature of positive market-timing skill, Henriksson-Merton's original test on monthly data.
How to read these comparisons
The leverage-matched comparison is the correct one for descriptive statistics (Sharpe, total return, max DD, Calmar). The relevant gap is in the performance table above: strategy Sharpe 1.84 versus static 2.48× SPY Sharpe 0.68.
For regression-based tests, the choice of SPY versus k·SPY does not affect α or the t-statistic on γ. The leverage-matched benchmark is a scaled version of SPY excess return, so OLS rescales β and γ but α and t-statistics are invariant. The question of whether skill exists beyond leverage is therefore answered by the Sharpe gap, not by regressing against k·SPY.
Sub-period stability is the second important check. A strategy with positive γ in calm regimes (2016 to 2019, 2023 to 2024) and negative γ in stress (2020 Q1, 2022) is a short-vol trade in disguise. A strategy with γ through 2020 Q1 and 2022 that does not collapse is exhibiting genuine market timing.
Limitations
- 10-year sample only. The dataset does not include the 2000 dot-com bust, the 2008 financial crisis, or the 1970s stagflation. Strategy parameters were tuned over many iterations on overlapping data, so a Deflated Sharpe penalty applies and live performance should be expected to be lower than the in-sample numbers shown here.
- Single-asset backtest. The walk-forward backtest uses SPY only. Live execution maps the same signal across SGOV / SPY / SSO / UPRO via a piecewise blend, which introduces ETF-specific tracking error and intra-day rebalance slippage that this analysis does not capture.
- Risk-free rate is a flat 4%. Real Treasury rates were below 1% from 2016 through early 2022 and above 5% from late 2023. A time-varying rate (such as the 3-month T-bill from FRED) would shift the leverage-matched benchmark slightly, though not materially, because financing cost on the (k − 1) borrowed dollars is small at the average exposure.
- Asymptotic inference. Newey-West HAC controls for serial correlation but assumes the autocorrelation structure is well-approximated by the chosen lag length. With ~2,500 daily observations the asymptotics are reliable; with ~120 monthly observations they are weaker.
- Multiple-testing concern. The strategy has been iterated hundreds of times on this exact dataset. The walk-forward harness mitigates but does not eliminate this. The Deflated Sharpe Ratio in the evaluation harness applies the formal correction.
Replication
All numbers in this report are produced by analysis/timing_tests.py in the
nightclaude repository. The full pipeline runs in under a second: load SPY data, run
the walk-forward backtest, compute the regressions, and render the HTML. The strategy
code is whatever strategy.py contains at the time of the run; the backtest
configuration uses StrategyConfig() defaults.
Treynor & Mazuy (1966), "Can Mutual Funds Outguess the Market?", Harvard Business Review 44, 131-136.
Henriksson & Merton (1981), "On Market Timing and Investment Performance. II. Statistical Procedures for Evaluating Forecasting Skills", Journal of Business 54, 513-533.
Report generated 2026-05-23 22:48. Backtest window: 2016-05-18 to 2026-05-22.