Return is observed. Robustness must be investigated.
A historical result is conditional on one market path, one model specification, one implementation model, and one sequence of research decisions. Alter those conditions and the conclusion may strengthen, weaken, or disappear. The purpose of validation is not to protect a favorable result from such perturbations. It is to expose the result to them.
Atamus Capital does not publish proprietary strategy rules, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, investment universes, portfolio construction methods, execution processes, or position-level information. This note examines the mathematics of validation through closed-form calculations, deterministic numerical experiments, and controlled Monte Carlo experiments under fully specified probability laws. No internal threshold, sequence, weighting rule, holding period, implementation assumption, or model-development workflow is disclosed. The analytical dimensions used below are a public taxonomy for discussing validation, not a description of Atamus Capital's internal research architecture.
Historical return is a realization produced by a particular sample, specification, research path, and implementation model. Robustness concerns whether the inference supported by that realization survives admissible changes to those conditions. We study systematic strategy evaluation as a problem of selection-adjusted inference, dependence, finite-sample uncertainty, specification geometry, temporal instability, implementation error, and path-dependent loss. Analytical calculations show how model selection can create apparently strong performance under a zero-skill null. A 100,000-replication Monte Carlo experiment estimates that the naive IID interval covers the true mean only about 84.9 percent of the time despite being labelled nominally 95 percent when moderate serial dependence is ignored. A second 100,000-path Monte Carlo experiment shows why one historical maximum drawdown should not be treated as a stable property of a strategy. We conclude that robustness is not a scalar score and not a synonym for profitability. It is a vector of conditional claims, each of which must be stated, tested, and governed separately.
1. Return is conditional
Let a systematic strategy be denoted by \(S\). A reported performance statistic is not generated by \(S\) alone. It is produced by a complete evaluation state:
where
The same strategy evaluated under a perturbed state \(\delta\) produces
Robustness is therefore not the value of \(\widehat{\Psi}_0\). It concerns the probability law induced by admissible perturbations. Define \(g(\delta)=\widehat{\Psi}_{\delta}\). For every Borel set \(B\subseteq\mathbb R\),
Here, \(Q\) describes a defensible probability distribution over admissible perturbations. Two summaries of the induced distribution are useful:
and
The first is a survival probability relative to an evidentiary threshold \(\tau\), with \(\Psi\) oriented so that larger values are preferred. A loss functional can be handled by reversing the inequality or changing its sign. For a small value of \(\alpha\), the second is a lower quantile of performance across perturbations. Neither quantity is meaningful until the perturbation law \(Q\), the performance functional \(\Psi\), and the threshold \(\tau\) are justified.
This point is fundamental. A researcher can manufacture apparent robustness by choosing perturbations that are too small, too convenient, or too similar to the selected specification. Robustness is not created by running many tests. It is created by exposing a claim to alternatives that could reasonably have changed the conclusion.
For the purposes of this note, robustness can be represented as a multidimensional object:
The components refer to robustness against research selection, statistical dependence, specification changes, temporal instability, implementation uncertainty, and path-dependent loss. A favorable result in one component does not establish robustness in another. This taxonomy is conceptual and is not a map of Atamus Capital's internal validation or model-development workflow.
2. Selection changes the distribution of the result
Suppose a researcher examines \(m\) candidate models. For the baseline calculation, let
The reported candidate is often the one with the largest estimate:
Under the complete null,
we still have
The act of searching changes the probability law of the reported statistic. This is not a philosophical objection to model selection. It is a mathematical consequence of conditioning on an extreme order statistic.
Proposition 1. Selection inflation under equicorrelated Gaussian noise
Let
where \(F,\varepsilon_1,\ldots,\varepsilon_m\) are independent standard normal variables and \(0\leq\rho<1\). Then
Proof. The common factor is identical across all candidates, so
Taking expectations and using \(\mathbb E[F]=0\) gives the result.
For independent standard normal variables,
where \(\phi\) and \(\Phi\) denote the standard normal density and distribution functions.
To translate this into a familiar performance statistic, consider five years of daily observations, \(n=1{,}260\), and an annualization constant \(A=252\). In a Gaussian zero-skill surrogate with known unit daily volatility, let \(\bar X_j\) be the sample mean and write \(Z_j=\sqrt n\,\bar X_j\). The known-volatility annualized Sharpe estimator is then
The construction is deliberately narrow. It isolates selection from variance estimation and every other source of error. A full Sharpe-ratio inference problem also depends on serial correlation, variance estimation, and higher moments.[5]
For independent candidates, the expected selected annualized Sharpe is 0.835 after 20 trials, 1.121 after 100 trials, and 1.358 after 500 trials. Every candidate has true Sharpe zero. With candidate correlation \(\rho=0.50\), the corresponding values are lower, but still material: 0.591, 0.793, and 0.960.
A related calculation concerns false discoveries. If each of \(m\) independent null hypotheses is tested at one-sided level \(\alpha\), then
At \(\alpha=0.05\), the probability of at least one false rejection is 64.15 percent after 20 independent trials and 99.41 percent after 100 trials. Positive equicorrelation reduces the effective breadth of the search relative to independence, but it does not remove the problem. Under the equicorrelated model, the exact probability is
where \(c_{\alpha}=\Phi^{-1}(1-\alpha)\). With \(\rho=0.50\), the familywise false-rejection probabilities for 20, 100, and 500 trials are 33.95 percent, 56.38 percent, and 74.68 percent.
| Candidate models | Expected selected Sharperho = 0 | FWERrho = 0 | Expected selected Sharperho = 0.50 | FWERrho = 0.50 |
|---|---|---|---|---|
| 20 | 0.835 | 64.15% | 0.591 | 33.95% |
| 100 | 1.121 | 99.41% | 0.793 | 56.38% |
| 500 | 1.358 | 100.00% when rounded | 0.960 | 74.68% |
View data
| Candidate count | Correlation | Expected selected Sharpe | Familywise probability |
|---|---|---|---|
| 1 | 0.00 | 0.000000 | 5.0000% |
| 2 | 0.00 | 0.252313 | 9.7500% |
| 5 | 0.00 | 0.520094 | 22.6219% |
| 10 | 0.00 | 0.688151 | 40.1263% |
| 20 | 0.00 | 0.835160 | 64.1514% |
| 50 | 0.00 | 1.005816 | 92.3055% |
| 100 | 0.00 | 1.121430 | 99.4079% |
| 200 | 0.00 | 1.228068 | 99.9965% |
| 500 | 0.00 | 1.358053 | 100.0000% |
| 1 | 0.25 | 0.000000 | 5.0000% |
| 2 | 0.25 | 0.218510 | 9.3857% |
| 5 | 0.25 | 0.450414 | 19.9426% |
| 10 | 0.25 | 0.595956 | 32.2546% |
| 20 | 0.25 | 0.723270 | 47.3737% |
| 50 | 0.25 | 0.871062 | 67.6278% |
| 100 | 0.25 | 0.971187 | 80.0832% |
| 200 | 0.25 | 1.063538 | 88.8723% |
| 500 | 0.25 | 1.176109 | 95.4893% |
| 1 | 0.50 | 0.000000 | 5.0000% |
| 2 | 0.50 | 0.178412 | 8.7811% |
| 5 | 0.50 | 0.367762 | 16.6327% |
| 10 | 0.50 | 0.486596 | 24.6621% |
| 20 | 0.50 | 0.590547 | 33.9489% |
| 50 | 0.50 | 0.711220 | 46.9199% |
| 100 | 0.50 | 0.792971 | 56.3827% |
| 200 | 0.50 | 0.868375 | 65.0025% |
| 500 | 0.50 | 0.960289 | 74.6751% |
| 1 | 0.75 | 0.000000 | 5.0000% |
| 2 | 0.75 | 0.126157 | 7.7990% |
| 5 | 0.75 | 0.260047 | 12.6116% |
| 10 | 0.75 | 0.344076 | 16.9049% |
| 20 | 0.75 | 0.417580 | 21.5689% |
| 50 | 0.75 | 0.502908 | 28.0411% |
| 100 | 0.75 | 0.560715 | 33.0105% |
| 200 | 0.75 | 0.614034 | 37.9390% |
| 500 | 0.75 | 0.679027 | 44.2755% |
White's Reality Check, Hansen's test for superior predictive ability, false-discovery controls, and selection-adjusted performance statistics address different versions of this problem.[1][2][3][4] No single procedure is universally correct. The appropriate method depends on the estimand, the benchmark, the dependence among candidates, and what information about the research search has been recorded.
The operational lesson is narrower and more demanding:
A performance statistic cannot be interpreted without the process that selected it.
A research record that preserves only the winner discards information required to adjust inference for the search that produced it.
3. Calendar length is not information length
The number of rows in a dataset is not the number of independent observations it contains.
Let \(X_t\) be covariance-stationary with variance \(\gamma_0\) and autocorrelation sequence \(\rho_k=\gamma_k/\gamma_0\). Then
For inference on the sample mean, this motivates the variance-equivalent effective sample size
This is not a universal information count. It is the number of independent observations with marginal variance \(\gamma_0\) that would give the same variance for \(\bar X_n\). A different estimand can have a different effective sample size.
For an AR(1) process with \(\rho_k=\varphi^k\), the large-sample approximation is
Consider a nominal five-year daily sample with \(n=1{,}260\). The exact finite-sample calculation gives:
| φ | neff for the mean | Effective years for the mean | Standard-error inflation |
|---|---|---|---|
| 0.10 | 1,031.1 | 4.09 | 1.11 |
| 0.25 | 756.3 | 3.00 | 1.29 |
| 0.50 | 420.4 | 1.67 | 1.73 |
| 0.75 | 180.5 | 0.72 | 2.64 |
At \(\varphi=0.50\), inference on the mean from five calendar years has the same variance as inference from approximately 1.67 independent years with the same marginal variance. The IID variance formula understates the standard error of the sample mean by a factor of approximately 1.73 under this model.
Figure 2 selected 5 years, phi 0.50.
View data
| Nominal years | 5 |
|---|---|
| Nominal observations | 1260 |
| AR(1) coefficient phi | 0.50 |
| Large-sample information-retention ratio | 33.33% |
| Exact finite-sample effective sample size | 420.4449 |
| Exact finite-sample information-retention ratio | 33.3686% |
| Exact effective years | 1.6684 |
| Standard-error inflation | 1.7311x |
Assumptions: AR(1) dependence; 252 observations per nominal year; estimand is the sample mean; calculations are analytical and model-based; no market data is used; no Atamus Capital strategy data is used.
The AR(1) calculation is not offered as a model of every return series. It is an instrument for showing how quickly nominal information can collapse under dependence. In applied research, dependence can arise from overlapping labels, persistent exposures, repeated observations of the same underlying state, cross-sectional commonality, stale prices, smoothing, or portfolio construction. Each source requires its own treatment.
The same principle applies to candidate strategies. One hundred highly correlated trials are not one hundred independent trials. Yet replacing the literal trial count with a vaguely defined effective count can create a different form of false precision. Dependence must be estimated, uncertainty around that estimate must be acknowledged, and the research archive must preserve enough information to make the calculation possible.
4. Nominal confidence is conditional confidence
A confidence interval is not a property of the observed statistic alone. Its coverage depends on the data-generating process and the estimator used for uncertainty.
We tested this directly through 100,000 Monte Carlo replications for each of four stylized processes. Every process has true mean zero and unit unconditional variance. Each replication contains \(T=504\) observations.
The four processes are:
- independent Gaussian observations;
- independent standardized Student \(t_5\) observations;
- AR(1) observations with Gaussian innovations and \(\varphi=0.30\);
- AR(1) observations with standardized Student \(t_5\) innovations and \(\varphi=0.30\).
Let \(\varepsilon_t\) be independent, mean-zero, unit-variance innovations, Gaussian or standardized Student \(t_5\) as specified above. The simulated sample is indexed \(X_1,\ldots,X_T\). For the AR(1) cases, the experiment sets \(X_1=\varepsilon_1\) and, for \(t=2,\ldots,T\),
This construction has \(\mathbb E[X_t]=0\), \(\operatorname{Var}(X_t)=1\), and \(\operatorname{Cov}(X_t,X_{t-k})=\varphi^k\) for every admissible \(t\) and \(k\). It is covariance-stationary even though, with non-Gaussian innovations, strict stationarity of the initial marginal law is not asserted.
The naive interval is
where
The dependence-aware interval uses a Bartlett-kernel Newey-West estimator of long-run variance with lag \(L=10\). Define
and
The resulting interval is
An oracle standard-error interval uses the exact finite-sample variance of the sample mean together with the same Gaussian critical value. It is a diagnostic benchmark, not an exactly calibrated finite-sample interval for every non-Gaussian process, and it is unavailable in real research.
| Data-generating process | Naive IID interval | HAC interval | Oracle interval |
|---|---|---|---|
| IID Gaussian | 94.95% | 94.34% | 95.01% |
| IID Student \(t_5\) | 95.01% | 94.39% | 95.06% |
| AR(1), Gaussian innovations, \(\varphi=0.30\) | 84.92% | 93.59% | 95.08% |
| AR(1), Student \(t_5\) innovations, \(\varphi=0.30\) | 84.83% | 93.53% | 95.01% |
The largest Monte Carlo standard error in the table is approximately 0.113 percentage points.
View data
| Data-generating process | Method | Coverage | Monte Carlo SE | Mean half-width |
|---|---|---|---|---|
| IID Gaussian | IID | 94.950% | 0.000692 | 0.087266 |
| IID Gaussian | HAC | 94.343% | 0.000731 | 0.086057 |
| IID Gaussian | ORACLE | 95.010% | 0.000689 | 0.087304 |
| IID Student t(5) | IID | 95.009% | 0.000689 | 0.087157 |
| IID Student t(5) | HAC | 94.386% | 0.000728 | 0.085944 |
| IID Student t(5) | ORACLE | 95.055% | 0.000686 | 0.087304 |
| AR(1), Gaussian innovations | IID | 84.920% | 0.001132 | 0.087193 |
| AR(1), Gaussian innovations | HAC | 93.586% | 0.000775 | 0.113550 |
| AR(1), Gaussian innovations | ORACLE | 95.075% | 0.000684 | 0.118897 |
| AR(1), Student t(5) innovations | IID | 84.828% | 0.001134 | 0.087083 |
| AR(1), Student t(5) innovations | HAC | 93.532% | 0.000778 | 0.113350 |
| AR(1), Student t(5) innovations | ORACLE | 95.011% | 0.000688 | 0.118897 |
The result is not that one estimator should always replace another. Under the independent processes, the fixed-lag HAC intervals are slightly narrower on average in this finite sample and show modest undercoverage. Under serial dependence, the HAC interval materially improves coverage but does not completely recover the nominal rate. Bandwidth selection, persistence, tail behavior, and sample length remain consequential. Newey and West established a positive semi-definite covariance construction that is consistent when its assumptions and bandwidth conditions are satisfied, not a guarantee of exact finite-sample coverage.[6] The fixed choice \(L=10\) is an explicit finite-sample design choice for this experiment. Because the AR(1) process has nonzero autocovariances beyond lag 10, this fixed-lag estimator is not presented as an asymptotically consistent long-run variance estimator for that process.
This is the distinction between a robustness method and a robustness result. A method can be theoretically valid under stated conditions while a finite sample still contains too little information for the requested precision.
5. Specification robustness is geometric
A model that works only at one narrow parameter combination is structurally different from a model whose conclusion persists across a defensible neighborhood.
Let \(J(\theta)\) be a validation objective at specification \(\theta\). Define local fragility within radius \(\epsilon\) as
A complementary measure is the stability volume
where \(B_{\epsilon}\) is the perturbation ball and \(\nu\) is its volume measure.
Peak height and stability volume answer different questions. The following controlled construction makes the distinction exact:
Both surfaces have the same maximum:
At the common optimum, the Hessian is
The peak-curvature magnitude reported in the interactive figure is therefore
the absolute value of either Hessian eigenvalue at the optimum. One surface is broad, with \(\sigma=0.45\). The other is narrow, with \(\sigma=0.12\). Set the evidence threshold to \(\tau=0.20\) and the admissible perturbation radius to \(\epsilon=0.50\). The threshold radius is
Because the surface is radially symmetric and the perturbation is centered at the common optimum \(\theta=0\), the stability volume is
For the broad surface, \(r_{\tau}=0.686\), so the entire perturbation ball remains above the threshold and \(\mathcal V=1.000\). For the narrow surface, \(r_{\tau}=0.183\), so only 13.40 percent of the same perturbation ball survives. The peak value is identical.
View data
Formula: \(J_{\sigma}(\theta)=0.80\exp(-\lVert\theta\rVert^2/(2\sigma^2))-0.05\). Default threshold: 0.20. Default perturbation radius: 0.50.
| Surface | Sigma | Peak value | Threshold radius | Stability-volume fraction |
|---|---|---|---|---|
| Broad | 0.450 | 0.750 | 0.68635 | 100.00% |
| Narrow | 0.120 | 0.750 | 0.18303 | 13.40% |
A flat surface is not proof that a model is correct. A misspecified model can be smoothly wrong. Nor is every sharp optimum invalid. Some physical and economic mechanisms are genuinely localized. The point is that sensitivity geometry is evidence that must be interpreted, not a cosmetic chart placed around the selected parameter.
The perturbation set should also be established without consulting the desired result. If the radius is chosen after the surface is observed, the robustness test becomes another optimization variable.
6. Stability within a regime is not stability across regimes
Parameter uncertainty and distributional change are distinct problems.
Suppose returns satisfy
where \(Z_t\) denotes a latent state and \(\gamma_z\) denotes any additional shape or tail parameter required by the chosen distributional family. A model may estimate \(\mu_z\) precisely conditional on a state and still fail because the state distribution, transition mechanism, or mapping from state to returns changes.
A simple Markov representation is
This formalizes conditional environments. It does not establish that the states are economically real, correctly identified, or persistent. Structural-break procedures can estimate and test changes in parameters under stated conditions, but they do not reveal future break dates.[8]
A distributionally robust expression makes the logical problem explicit:
where \(P_0\) is an estimated distribution, \(\mathcal U_{\rho}(P_0)\) is a defensible uncertainty set around it, and \(u\) is a stated utility or evaluation function. The result depends on the geometry and radius of \(\mathcal U_{\rho}\). An uncertainty set broad enough to include everything is uninformative. One narrow enough to exclude relevant change is comforting but false.
For this reason, regime robustness should not be reduced to showing that a model worked in several historical subperiods. Subperiods are themselves selected, often correlated, and may repeat the same underlying economic state. A credible analysis asks which aspects of the model are expected to remain invariant, which are allowed to move, what evidence would contradict the invariance claim, and how quickly deterioration could be detected.
7. Implementation is a random variable
A modeled return is not the return available to capital.
Write
where \(C_t\) denotes direct transaction and financing costs, \(L_t\) denotes liquidity and market-impact effects, and \(E_t\) denotes timing, execution, operational, and measurement error.
Treating these terms as fixed deductions is often insufficient. Their distributions may depend on market state, strategy activity, and the same conditions that affect gross return. Even when the expectation decomposes linearly,
risk does not:
If friction rises when the modeled return is weakest, the covariance term is negative and net risk increases. This is one reason execution cannot be appended as a constant haircut after model selection. The classical optimal-execution literature formalizes the tradeoff between expected cost and uncertainty even in relatively simple impact models.[10]
A robust implementation analysis can be written as
or, when \(c\) is stochastic,
The uncertainty set \(\mathcal C\), the distribution \(Q_c\), and the threshold \(\tau\) are strategy-specific. Publishing them can reveal turnover, liquidity, capacity, holding period, or execution design. They are therefore intentionally absent from this note.
The public principle is sufficient:
Implementation robustness asks whether the inference survives uncertainty in the process that converts a model into realized positions and realized trades.
8. A historical drawdown is one path
Average return and volatility do not describe the sequence by which capital is gained or lost.
For a positive wealth process \(V_t\), define drawdown as
and maximum drawdown over horizon \(T\) as
Maximum drawdown is a path functional. It depends on the order of returns, not only their marginal distribution. Its behavior has been studied analytically even for Brownian motion, where the distribution is already nontrivial.[11]
To demonstrate the difference between one realized drawdown and a drawdown distribution, we generated 100,000 independent three-year paths from a fully disclosed stylized process. Here, \(r_t\) is a daily log return:
Because \(\operatorname{Var}(u_t)=5/3\), the scaled innovation \(z_t\) has unit variance. Since \(0.05+0.93<1\), the shock process has unconditional daily variance \(0.12^2/252\). The parameters therefore imply 6 percent annualized expected log return and 12 percent unconditional annualized volatility within this stipulated model.[9] They are illustrative assumptions, not estimates from market data, not calibrations to an Atamus strategy, and not Atamus return targets, volatility targets, or risk limits. Each path begins with conditional variance set to the unconditional variance, then uses 756 retained observations after a 500-observation burn-in. The random seed is 20260622.
The displayed path is replication 1. It was not selected by outcome. Its maximum drawdown is 15.99 percent. Across all paths, the median maximum drawdown is 15.39 percent, the 90th percentile is 26.16 percent, the 95th percentile is 30.27 percent, and the 99th percentile is 39.12 percent.
| Quantile | Maximum drawdown |
|---|---|
| 1% | 6.88% |
| 5% | 8.52% |
| 10% | 9.60% |
| 25% | 11.89% |
| 50% | 15.39% |
| 75% | 20.30% |
| 90% | 26.16% |
| 95% | 30.27% |
| 99% | 39.12% |
The quantiles are Monte Carlo estimates and therefore have simulation uncertainty. Let \(F_D\) denote the population distribution of maximum drawdown under the stipulated process. If \(F_D\) is continuous at its population \(p\)-quantile \(q_p\), so that \(F_D(q_p)=p\), and the simulated paths are independent, then
This gives finite-sample confidence intervals from simulation ranks. The exact binomial-rank intervals have at least 95 percent coverage. Their endpoints, rounded for display, are approximately 15.34 to 15.44 percent for the median, 30.14 to 30.43 percent for the 95th percentile, and 38.74 to 39.52 percent for the 99th percentile. These intervals quantify Monte Carlo error only. They do not quantify uncertainty about whether the stipulated return process is an adequate model of any market or strategy.
View data
Displayed path maximum drawdown: 15.99%. Terminal return: 12.32%. The path is replication 1 and was not selected by outcome.
View data
| Quantile | Maximum drawdown | Monte Carlo rank interval |
|---|---|---|
| 1% | 6.88% | 6.83% to 6.93% |
| 5% | 8.52% | 8.49% to 8.55% |
| 10% | 9.60% | 9.57% to 9.64% |
| 25% | 11.89% | 11.85% to 11.93% |
| 50% | 15.39% | 15.34% to 15.44% |
| 75% | 20.30% | 20.23% to 20.37% |
| 90% | 26.16% | 26.05% to 26.28% |
| 95% | 30.27% | 30.14% to 30.43% |
| 99% | 39.12% | 38.74% to 39.52% |
If the first path were the only observed history, a researcher might report 15.99 percent as though it described the process. In this experiment, 5 percent of otherwise identical paths lose more than 30.27 percent from peak to trough. The historical maximum is evidence about what happened. It is not a deterministic limit on what could happen under the same stipulated process.
A simulated drawdown distribution is also conditional. It inherits the assumptions of the return model, dependence structure, horizon, drift, tail law, and parameter estimates. For stationary weakly dependent data, block methods such as the stationary bootstrap can retain local dependence information that IID resampling discards under their stated regularity conditions. They remain conditional on the observed history and the block-length rule.[7] There is no assumption-free drawdown forecast.
9. Robustness cannot be compressed without loss
The temptation is to combine every diagnostic into one score:
This can be useful as a generic decision aid, but it can also conceal failure. A high score in parameter stability can offset weak selection control if the weights permit it. Strong historical drawdown behavior can offset implementation uncertainty. The arithmetic may be valid while the decision logic is not.
For public analytical purposes, this note treats robustness as an evidence architecture rather than a ranking number. Each claim should identify at least five objects:
A robust result is not one that survives every imaginable attack. It is one that survives a set of relevant, documented, and sufficiently severe challenges without relying on hidden changes to the claim.
Several boundaries follow.
First, robustness does not prove profitability. It reduces the set of plausible explanations for an observed result. It cannot eliminate model risk or future change.
Second, more tests do not automatically create more evidence. Tests that reuse the same information, inspect the same failure mode, or were chosen after seeing the result may add little.
Third, a failed robustness test is information. It should not be silently converted into a revised model and forgotten. The failed specification remains part of the research path and therefore part of selection-adjusted inference.
Fourth, robustness is conditional on admissibility. A model should not be required to survive perturbations that contradict its economic or mathematical premise. The burden is to define that boundary before the result is known.
Finally, live observation is not a ceremonial final box. It is a new source of evidence under conditions that simulation cannot completely reproduce. Even then, live evidence remains finite, path-dependent, and subject to selection.
10. Conclusion
Return is an observation. Robustness is the behavior of an inference under admissible alternatives.
A systematic strategy should not be judged by the height of one historical performance estimate. It should be studied through the research search that selected it, the dependence that governs uncertainty for the estimand, the specification geometry around it, the possibility of structural change, the uncertainty of implementation, and the distribution of paths capital may experience.
The calculations in this note show why. In the stylized five-year Gaussian estimator model, selection alone produces an expected annualized Sharpe above 1.1 after 100 independent trials under a zero-skill null. In the 504-observation AR(1) experiment, ignoring dependence reduces nominal 95 percent coverage to approximately 84.9 percent. In the disclosed drawdown experiment, the 95th-percentile maximum drawdown is approximately 1.89 times the displayed path maximum.
None of these facts establishes that a strategy is invalid. They establish that return is not the first question.
The first question is whether the conclusion survives the process required to believe it.
For Atamus Capital, that is the meaning of robustness before return.
Methodology and reproducibility
All numerical results in this note are generated by the accompanying reproducibility package. No market data, Atamus strategy data, or third-party return series is used. The numerical examples are analytical constructions or Monte Carlo estimates under the assumptions stated in the article.
- Random seed:
20260622 - Selection calculations: exact identities evaluated by adaptive numerical quadrature
- Confidence-interval experiment: 100,000 replications per process; standard Newey-West autocovariance normalization \(T^{-1}\) and Bartlett lag \(L=10\)
- Drawdown experiment: 100,000 paths, 756 post-burn observations per path, 500-observation burn-in; point estimates use the linear sample-quantile convention; all reported quantiles are finite Monte Carlo estimates conditional on the stated GARCH-Student model, with at-least-95-percent binomial-rank intervals for simulation error
- Parameter-surface experiment: exact analytical construction on an 81 by 81 grid
- Software environment used for the published data: Python 3.13.5, NumPy 2.3.5, SciPy 1.17.0, pandas 2.2.3, Matplotlib 3.10.8
The graph data file contains calculated analytical values, grid values, the exact sorted model-implied maximum-drawdown sample used by the empirical CDF, and the first simulated drawdown path. The complete experiments can be regenerated from source code. No proprietary Atamus Capital data is present.
References
[1] Halbert White, "A Reality Check for Data Snooping", Econometrica, 68(5), 1097-1126, 2000.
[2] Peter R. Hansen, "A Test for Superior Predictive Ability", Journal of Business & Economic Statistics, 23(4), 365-380, 2005.
[3] Campbell R. Harvey, Yan Liu, and Heqing Zhu, "... and the Cross-Section of Expected Returns", Review of Financial Studies, 29(1), 5-68, 2016.
[4] David H. Bailey and Marcos Lopez de Prado, "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality", Journal of Portfolio Management, 40(5), 94-107, 2014.
[5] Andrew W. Lo, "The Statistics of Sharpe Ratios", Financial Analysts Journal, 58(4), 36-52, 2002.
[6] Whitney K. Newey and Kenneth D. West, "A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix", Econometrica, 55(3), 703-708, 1987.
[7] Dimitris N. Politis and Joseph P. Romano, "The Stationary Bootstrap", Journal of the American Statistical Association, 89(428), 1303-1313, 1994.
[8] Jushan Bai and Pierre Perron, "Computation and Analysis of Multiple Structural Change Models", Journal of Applied Econometrics, 18(1), 1-22, 2003.
[9] Tim Bollerslev, "Generalized Autoregressive Conditional Heteroskedasticity"90063-1), Journal of Econometrics, 31(3), 307-327, 1986.
[10] Robert Almgren and Neil Chriss, "Optimal Execution of Portfolio Transactions", Journal of Risk, 3(2), 5-39, 2001.
[11] Malik Magdon-Ismail, Amir F. Atiya, Amrit Pratap, and Yaser S. Abu-Mostafa, "On the Maximum Drawdown of a Brownian Motion", Journal of Applied Probability, 41(1), 147-161, 2004.
Disclaimer
Research notes published by Atamus Capital are provided for general informational and research purposes only. They do not constitute investment advice, trading advice, a recommendation, an offer to sell, or a solicitation to buy any security, fund interest, account, or investment product.
This note does not disclose Atamus Capital's proprietary strategies, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, portfolio construction methods, execution processes, investment universe, research thresholds, model-development workflow, or investment decisions.