Robustness Before Return | Atamus Capital

Return is observed. Robustness must be investigated.

A historical result is conditional on one market path, one model specification, one implementation model, and one sequence of research decisions. Alter those conditions and the conclusion may strengthen, weaken, or disappear. The purpose of validation is not to protect a favorable result from such perturbations. It is to expose the result to them.

Scope of this note

Atamus Capital does not publish proprietary strategy rules, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, investment universes, portfolio construction methods, execution processes, or position-level information. This note examines the mathematics of validation through closed-form calculations, deterministic numerical experiments, and controlled Monte Carlo experiments under fully specified probability laws. No internal threshold, sequence, weighting rule, holding period, implementation assumption, or model-development workflow is disclosed. The analytical dimensions used below are a public taxonomy for discussing validation, not a description of Atamus Capital's internal research architecture.

Abstract

Historical return is a realization produced by a particular sample, specification, research path, and implementation model. Robustness concerns whether the inference supported by that realization survives admissible changes to those conditions. We study systematic strategy evaluation as a problem of selection-adjusted inference, dependence, finite-sample uncertainty, specification geometry, temporal instability, implementation error, and path-dependent loss. Analytical calculations show how model selection can create apparently strong performance under a zero-skill null. A 100,000-replication Monte Carlo experiment estimates that the naive IID interval covers the true mean only about 84.9 percent of the time despite being labelled nominally 95 percent when moderate serial dependence is ignored. A second 100,000-path Monte Carlo experiment shows why one historical maximum drawdown should not be treated as a stable property of a strategy. We conclude that robustness is not a scalar score and not a synonym for profitability. It is a vector of conditional claims, each of which must be stated, tested, and governed separately.

1. Return is conditional

Let a systematic strategy be denoted by \(S\). A reported performance statistic is not generated by \(S\) alone. It is produced by a complete evaluation state:

\widehat{\Psi}_0 = \Psi\!\left(S;\,\mathcal D,\theta,c,\mathcal A\right),

where

\begin{aligned} \mathcal D & = \text{the observed data and its construction},\\ \theta & = \text{the model specification and estimated parameters},\\ c & = \text{the implementation and friction model},\\ \mathcal A & = \text{the research path that selected the surviving model}. \end{aligned}

The same strategy evaluated under a perturbed state \(\delta\) produces

\widehat{\Psi}_{\delta} = \Psi\!\left( S;\mathcal D_{\delta},\theta_{\delta},c_{\delta},\mathcal A_{\delta} \right).

Robustness is therefore not the value of \(\widehat{\Psi}_0\). It concerns the probability law induced by admissible perturbations. Define \(g(\delta)=\widehat{\Psi}_{\delta}\). For every Borel set \(B\subseteq\mathbb R\),

\mathcal L_Q\!\left(\widehat{\Psi}_{\delta}\right)(B) = Q\!\left(g^{-1}(B)\right) = Q\left\{\delta:\widehat{\Psi}_{\delta}\in B\right\}.

Here, \(Q\) describes a defensible probability distribution over admissible perturbations. Two summaries of the induced distribution are useful:

p_{\tau}(S) = \mathbb P_{\delta\sim Q} \left(\widehat{\Psi}_{\delta}\geq \tau\right),

and

q_{\alpha}(S) = \inf\left\{x: \mathbb P_{\delta\sim Q} \left(\widehat{\Psi}_{\delta}\leq x\right) \geq \alpha \right\}, \qquad 0<\alpha<1.

The first is a survival probability relative to an evidentiary threshold \(\tau\), with \(\Psi\) oriented so that larger values are preferred. A loss functional can be handled by reversing the inequality or changing its sign. For a small value of \(\alpha\), the second is a lower quantile of performance across perturbations. Neither quantity is meaningful until the perturbation law \(Q\), the performance functional \(\Psi\), and the threshold \(\tau\) are justified.

This point is fundamental. A researcher can manufacture apparent robustness by choosing perturbations that are too small, too convenient, or too similar to the selected specification. Robustness is not created by running many tests. It is created by exposing a claim to alternatives that could reasonably have changed the conclusion.

For the purposes of this note, robustness can be represented as a multidimensional object:

\mathfrak R(S) = \left( \mathfrak R_{\mathrm{sel}}, \mathfrak R_{\mathrm{dep}}, \mathfrak R_{\mathrm{spec}}, \mathfrak R_{\mathrm{time}}, \mathfrak R_{\mathrm{impl}}, \mathfrak R_{\mathrm{path}} \right).

The components refer to robustness against research selection, statistical dependence, specification changes, temporal instability, implementation uncertainty, and path-dependent loss. A favorable result in one component does not establish robustness in another. This taxonomy is conceptual and is not a map of Atamus Capital's internal validation or model-development workflow.

2. Selection changes the distribution of the result

Suppose a researcher examines \(m\) candidate models. For the baseline calculation, let

\widehat{\theta}_j = \theta_j+\sigma Z_j, \qquad Z_1,\ldots,Z_m\overset{\mathrm{iid}}{\sim}\mathcal N(0,1), \qquad \sigma>0.

The reported candidate is often the one with the largest estimate:

j^{\star} = \operatorname*{arg\,max}_{1\leq j\leq m} \widehat{\theta}_j.

Under the complete null,

H_0:\theta_1=\cdots=\theta_m=0,

we still have

\mathbb E\!\left[ \widehat{\theta}_{j^{\star}} \right] = \sigma\, \mathbb E\!\left[ \max_{1\leq j\leq m} Z_j \right] >0 \quad\text{for }m>1.

The act of searching changes the probability law of the reported statistic. This is not a philosophical objection to model selection. It is a mathematical consequence of conditioning on an extreme order statistic.

Proposition 1. Selection inflation under equicorrelated Gaussian noise

Let

Z_j = \sqrt{\rho}\,F + \sqrt{1-\rho}\,\varepsilon_j,

where \(F,\varepsilon_1,\ldots,\varepsilon_m\) are independent standard normal variables and \(0\leq\rho<1\). Then

\mathbb E\!\left[ \max_{1\leq j\leq m}Z_j \right] = \sqrt{1-\rho}\, \mathbb E\!\left[ \max_{1\leq j\leq m}\varepsilon_j \right].

Proof. The common factor is identical across all candidates, so

\max_j Z_j = \sqrt{\rho}\,F + \sqrt{1-\rho}\,\max_j\varepsilon_j.

Taking expectations and using \(\mathbb E[F]=0\) gives the result.

For independent standard normal variables,

\mathbb E[M_m] = m\int_{-\infty}^{\infty} z\,\phi(z)\,\Phi(z)^{m-1}\,dz, \qquad M_m=\max_{1\leq j\leq m}Z_j,

where \(\phi\) and \(\Phi\) denote the standard normal density and distribution functions.

To translate this into a familiar performance statistic, consider five years of daily observations, \(n=1{,}260\), and an annualization constant \(A=252\). In a Gaussian zero-skill surrogate with known unit daily volatility, let \(\bar X_j\) be the sample mean and write \(Z_j=\sqrt n\,\bar X_j\). The known-volatility annualized Sharpe estimator is then

\widehat{SR}_j = \sqrt A\,\bar X_j = \sqrt{\frac{A}{n}}\,Z_j.

The construction is deliberately narrow. It isolates selection from variance estimation and every other source of error. A full Sharpe-ratio inference problem also depends on serial correlation, variance estimation, and higher moments.[5]

For independent candidates, the expected selected annualized Sharpe is 0.835 after 20 trials, 1.121 after 100 trials, and 1.358 after 500 trials. Every candidate has true Sharpe zero. With candidate correlation \(\rho=0.50\), the corresponding values are lower, but still material: 0.591, 0.793, and 0.960.

A related calculation concerns false discoveries. If each of \(m\) independent null hypotheses is tested at one-sided level \(\alpha\), then

\operatorname{FWER}(m,\alpha) = 1-(1-\alpha)^m.

At \(\alpha=0.05\), the probability of at least one false rejection is 64.15 percent after 20 independent trials and 99.41 percent after 100 trials. Positive equicorrelation reduces the effective breadth of the search relative to independence, but it does not remove the problem. Under the equicorrelated model, the exact probability is

1- \int_{-\infty}^{\infty} \phi(f) \left[ \Phi\!\left( \frac{c_{\alpha}-\sqrt{\rho}\,f}{\sqrt{1-\rho}} \right) \right]^m \,df,

where \(c_{\alpha}=\Phi^{-1}(1-\alpha)\). With \(\rho=0.50\), the familywise false-rejection probabilities for 20, 100, and 500 trials are 33.95 percent, 56.38 percent, and 74.68 percent.

Candidate models	Expected selected Sharperho = 0	FWERrho = 0	Expected selected Sharperho = 0.50	FWERrho = 0.50
20	0.835	64.15%	0.591	33.95%
100	1.121	99.41%	0.793	56.38%
500	1.358	100.00% when rounded	0.960	74.68%

Figure 1

Selection inflation

Zero-skill Gaussian estimator model

Figure 1. Selection inflation under a zero-skill null. The primary view plots the expected selected annualized Sharpe against the number of candidate models. A control switches to the familywise probability that at least one candidate clears a one-sided nominal 5 percent threshold. Candidate statistics follow an equicorrelated Gaussian model. Exact one-dimensional identities are evaluated by high-accuracy adaptive quadrature. Five years of daily observations are assumed only to set the scale of the Sharpe estimator. The figure contains no Atamus strategy data.

View data

Candidate count	Correlation	Expected selected Sharpe	Familywise probability
1	0.00	0.000000	5.0000%
2	0.00	0.252313	9.7500%
5	0.00	0.520094	22.6219%
10	0.00	0.688151	40.1263%
20	0.00	0.835160	64.1514%
50	0.00	1.005816	92.3055%
100	0.00	1.121430	99.4079%
200	0.00	1.228068	99.9965%
500	0.00	1.358053	100.0000%
1	0.25	0.000000	5.0000%
2	0.25	0.218510	9.3857%
5	0.25	0.450414	19.9426%
10	0.25	0.595956	32.2546%
20	0.25	0.723270	47.3737%
50	0.25	0.871062	67.6278%
100	0.25	0.971187	80.0832%
200	0.25	1.063538	88.8723%
500	0.25	1.176109	95.4893%
1	0.50	0.000000	5.0000%
2	0.50	0.178412	8.7811%
5	0.50	0.367762	16.6327%
10	0.50	0.486596	24.6621%
20	0.50	0.590547	33.9489%
50	0.50	0.711220	46.9199%
100	0.50	0.792971	56.3827%
200	0.50	0.868375	65.0025%
500	0.50	0.960289	74.6751%
1	0.75	0.000000	5.0000%
2	0.75	0.126157	7.7990%
5	0.75	0.260047	12.6116%
10	0.75	0.344076	16.9049%
20	0.75	0.417580	21.5689%
50	0.75	0.502908	28.0411%
100	0.75	0.560715	33.0105%
200	0.75	0.614034	37.9390%
500	0.75	0.679027	44.2755%

White's Reality Check, Hansen's test for superior predictive ability, false-discovery controls, and selection-adjusted performance statistics address different versions of this problem.[1][2][3][4] No single procedure is universally correct. The appropriate method depends on the estimand, the benchmark, the dependence among candidates, and what information about the research search has been recorded.

The operational lesson is narrower and more demanding:

A performance statistic cannot be interpreted without the process that selected it.

A research record that preserves only the winner discards information required to adjust inference for the search that produced it.

3. Calendar length is not information length

The number of rows in a dataset is not the number of independent observations it contains.

Let \(X_t\) be covariance-stationary with variance \(\gamma_0\) and autocorrelation sequence \(\rho_k=\gamma_k/\gamma_0\). Then

\operatorname{Var}(\bar X_n) = \frac{\gamma_0}{n} \left[ 1+2\sum_{k=1}^{n-1} \left(1-\frac{k}{n}\right)\rho_k \right].

For inference on the sample mean, this motivates the variance-equivalent effective sample size

n_{\mathrm{eff}}^{(\bar X)} = \frac{n}{ 1+2\sum_{k=1}^{n-1} \left(1-\frac{k}{n}\right)\rho_k }.

This is not a universal information count. It is the number of independent observations with marginal variance \(\gamma_0\) that would give the same variance for \(\bar X_n\). A different estimand can have a different effective sample size.

For an AR(1) process with \(\rho_k=\varphi^k\), the large-sample approximation is

n_{\mathrm{eff}}^{(\bar X)} \approx n\frac{1-\varphi}{1+\varphi}.

Consider a nominal five-year daily sample with \(n=1{,}260\). The exact finite-sample calculation gives:

φ	n_eff for the mean	Effective years for the mean	Standard-error inflation
0.10	1,031.1	4.09	1.11
0.25	756.3	3.00	1.29
0.50	420.4	1.67	1.73
0.75	180.5	0.72	2.64

At \(\varphi=0.50\), inference on the mean from five calendar years has the same variance as inference from approximately 1.67 independent years with the same marginal variance. The IID variance formula understates the standard error of the sample mean by a factor of approximately 1.73 under this model.

Figure 2

Variance-equivalent information retained

Autocorrelation changes the fraction of information retained

Figure 2. Variance-equivalent information retained for the sample mean. The fixed curve shows the large-sample AR(1) information-retention ratio, \(n_{\mathrm{eff}}/n\) approximately equal to \((1-\varphi)/(1+\varphi)\). The horizon control updates exact finite-sample effective years and standard-error inflation for \(n = 252Y\) without rescaling the chart. Values above 100 percent can occur under negative autocorrelation. No market or strategy data is used.

View data

Nominal years	5
Nominal observations	1260
AR(1) coefficient phi	0.50
Large-sample information-retention ratio	33.33%
Exact finite-sample effective sample size	420.4449
Exact finite-sample information-retention ratio	33.3686%
Exact effective years	1.6684
Standard-error inflation	1.7311x

Assumptions: AR(1) dependence; 252 observations per nominal year; estimand is the sample mean; calculations are analytical and model-based; no market data is used; no Atamus Capital strategy data is used.

The AR(1) calculation is not offered as a model of every return series. It is an instrument for showing how quickly nominal information can collapse under dependence. In applied research, dependence can arise from overlapping labels, persistent exposures, repeated observations of the same underlying state, cross-sectional commonality, stale prices, smoothing, or portfolio construction. Each source requires its own treatment.

The same principle applies to candidate strategies. One hundred highly correlated trials are not one hundred independent trials. Yet replacing the literal trial count with a vaguely defined effective count can create a different form of false precision. Dependence must be estimated, uncertainty around that estimate must be acknowledged, and the research archive must preserve enough information to make the calculation possible.

4. Nominal confidence is conditional confidence

A confidence interval is not a property of the observed statistic alone. Its coverage depends on the data-generating process and the estimator used for uncertainty.

We tested this directly through 100,000 Monte Carlo replications for each of four stylized processes. Every process has true mean zero and unit unconditional variance. Each replication contains \(T=504\) observations.

The four processes are:

independent Gaussian observations;
independent standardized Student \(t_5\) observations;
AR(1) observations with Gaussian innovations and \(\varphi=0.30\);
AR(1) observations with standardized Student \(t_5\) innovations and \(\varphi=0.30\).

Let \(\varepsilon_t\) be independent, mean-zero, unit-variance innovations, Gaussian or standardized Student \(t_5\) as specified above. The simulated sample is indexed \(X_1,\ldots,X_T\). For the AR(1) cases, the experiment sets \(X_1=\varepsilon_1\) and, for \(t=2,\ldots,T\),

X_t = \varphi X_{t-1} + \sqrt{1-\varphi^2}\,\varepsilon_t.

This construction has \(\mathbb E[X_t]=0\), \(\operatorname{Var}(X_t)=1\), and \(\operatorname{Cov}(X_t,X_{t-k})=\varphi^k\) for every admissible \(t\) and \(k\). It is covariance-stationary even though, with non-Gaussian innovations, strict stationarity of the initial marginal law is not asserted.

The naive interval is

\bar X \pm z_{0.975}\frac{s}{\sqrt T},

where

s^2 = \frac{1}{T-1} \sum_{t=1}^{T} \left(X_t-\bar X\right)^2.

The dependence-aware interval uses a Bartlett-kernel Newey-West estimator of long-run variance with lag \(L=10\). Define

\widehat{\gamma}_k = \frac{1}{T} \sum_{t=k+1}^{T} \left(X_t-\bar X\right) \left(X_{t-k}-\bar X\right), \qquad k=0,1,\ldots,L,

and

\widehat{\Omega}_L = \widehat{\gamma}_0 + 2\sum_{k=1}^{L} \left(1-\frac{k}{L+1}\right) \widehat{\gamma}_k.

The resulting interval is

\bar X \pm z_{0.975} \sqrt{\frac{\widehat{\Omega}_L}{T}}.

An oracle standard-error interval uses the exact finite-sample variance of the sample mean together with the same Gaussian critical value. It is a diagnostic benchmark, not an exactly calibrated finite-sample interval for every non-Gaussian process, and it is unavailable in real research.

Data-generating process	Naive IID interval	HAC interval	Oracle interval
IID Gaussian	94.95%	94.34%	95.01%
IID Student \(t_5\)	95.01%	94.39%	95.06%
AR(1), Gaussian innovations, \(\varphi=0.30\)	84.92%	93.59%	95.08%
AR(1), Student \(t_5\) innovations, \(\varphi=0.30\)	84.83%	93.53%	95.01%

The largest Monte Carlo standard error in the table is approximately 0.113 percentage points.

Figure 3

Nominal and empirical coverage

Stylized 100,000-replication experiment

Figure 3. Empirical coverage of nominal 95 percent intervals. Grouped bars compare the IID, HAC, and oracle intervals across the four processes. A horizontal reference marks 95 percent. Tooltips display coverage, Monte Carlo standard error, and mean interval half-width. The oracle bar is visually de-emphasized and labelled as an unavailable benchmark.

View data

Data-generating process	Method	Coverage	Monte Carlo SE	Mean half-width
IID Gaussian	IID	94.950%	0.000692	0.087266
IID Gaussian	HAC	94.343%	0.000731	0.086057
IID Gaussian	ORACLE	95.010%	0.000689	0.087304
IID Student t(5)	IID	95.009%	0.000689	0.087157
IID Student t(5)	HAC	94.386%	0.000728	0.085944
IID Student t(5)	ORACLE	95.055%	0.000686	0.087304
AR(1), Gaussian innovations	IID	84.920%	0.001132	0.087193
AR(1), Gaussian innovations	HAC	93.586%	0.000775	0.113550
AR(1), Gaussian innovations	ORACLE	95.075%	0.000684	0.118897
AR(1), Student t(5) innovations	IID	84.828%	0.001134	0.087083
AR(1), Student t(5) innovations	HAC	93.532%	0.000778	0.113350
AR(1), Student t(5) innovations	ORACLE	95.011%	0.000688	0.118897

The result is not that one estimator should always replace another. Under the independent processes, the fixed-lag HAC intervals are slightly narrower on average in this finite sample and show modest undercoverage. Under serial dependence, the HAC interval materially improves coverage but does not completely recover the nominal rate. Bandwidth selection, persistence, tail behavior, and sample length remain consequential. Newey and West established a positive semi-definite covariance construction that is consistent when its assumptions and bandwidth conditions are satisfied, not a guarantee of exact finite-sample coverage.[6] The fixed choice \(L=10\) is an explicit finite-sample design choice for this experiment. Because the AR(1) process has nonzero autocovariances beyond lag 10, this fixed-lag estimator is not presented as an asymptotically consistent long-run variance estimator for that process.

This is the distinction between a robustness method and a robustness result. A method can be theoretically valid under stated conditions while a finite sample still contains too little information for the requested precision.

5. Specification robustness is geometric

A model that works only at one narrow parameter combination is structurally different from a model whose conclusion persists across a defensible neighborhood.

Let \(J(\theta)\) be a validation objective at specification \(\theta\). Define local fragility within radius \(\epsilon\) as

\mathcal F_{\epsilon}(\theta) = \sup_{\|\delta\|\leq\epsilon} \left| J(\theta+\delta)-J(\theta) \right|.

A complementary measure is the stability volume

\mathcal V_{\epsilon,\tau}(\theta) = \frac{ \nu\left( \left\{\delta\in B_{\epsilon}: J(\theta+\delta)\geq\tau \right\} \right) }{ \nu(B_{\epsilon}) },

where \(B_{\epsilon}\) is the perturbation ball and \(\nu\) is its volume measure.

Peak height and stability volume answer different questions. The following controlled construction makes the distinction exact:

J_{\sigma}(\theta) = 0.80\exp\!\left( -\frac{\|\theta\|^2}{2\sigma^2} \right) -0.05, \qquad \theta\in[-1,1]^2.

Both surfaces have the same maximum:

J_{\sigma}(0)=0.75.

At the common optimum, the Hessian is

\nabla^2 J_{\sigma}(0) = -\frac{0.80}{\sigma^2}I_2.

The peak-curvature magnitude reported in the interactive figure is therefore

\kappa_{\sigma} = \frac{0.80}{\sigma^2},

the absolute value of either Hessian eigenvalue at the optimum. One surface is broad, with \(\sigma=0.45\). The other is narrow, with \(\sigma=0.12\). Set the evidence threshold to \(\tau=0.20\) and the admissible perturbation radius to \(\epsilon=0.50\). The threshold radius is

r_{\tau} = \sigma \sqrt{ -2\log\!\left( \frac{\tau+0.05}{0.80} \right) }.

Because the surface is radially symmetric and the perturbation is centered at the common optimum \(\theta=0\), the stability volume is

\mathcal V_{\epsilon,\tau}(0) = \min\left \{1,\frac{r_{\tau}^2}{\epsilon^2}\right\}.

For the broad surface, \(r_{\tau}=0.686\), so the entire perturbation ball remains above the threshold and \(\mathcal V=1.000\). For the narrow surface, \(r_{\tau}=0.183\), so only 13.40 percent of the same perturbation ball survives. The peak value is identical.

Figure 4

Specification fragility surface

tau epsilon

Analytical objective surface

Figure 4. Equal peaks, unequal stability. An interactive contour field switches between the broad and narrow surfaces. The user can change the perturbation radius \(\epsilon\) and threshold \(\tau\). The admissible ball, threshold contour, peak-curvature magnitude, and stability-volume fraction update in real time. The construction is analytical, not fitted to a strategy.

View data

Formula: \(J_{\sigma}(\theta)=0.80\exp(-\lVert\theta\rVert^2/(2\sigma^2))-0.05\). Default threshold: 0.20. Default perturbation radius: 0.50.

Surface	Sigma	Peak value	Threshold radius	Stability-volume fraction
Broad	0.450	0.750	0.68635	100.00%
Narrow	0.120	0.750	0.18303	13.40%

A flat surface is not proof that a model is correct. A misspecified model can be smoothly wrong. Nor is every sharp optimum invalid. Some physical and economic mechanisms are genuinely localized. The point is that sensitivity geometry is evidence that must be interpreted, not a cosmetic chart placed around the selected parameter.

The perturbation set should also be established without consulting the desired result. If the radius is chosen after the surface is observed, the robustness test becomes another optimization variable.

6. Stability within a regime is not stability across regimes

Parameter uncertainty and distributional change are distinct problems.

Suppose returns satisfy

R_t\mid Z_t=z \sim \mathcal D(\mu_z,\sigma_z,\gamma_z),

where \(Z_t\) denotes a latent state and \(\gamma_z\) denotes any additional shape or tail parameter required by the chosen distributional family. A model may estimate \(\mu_z\) precisely conditional on a state and still fail because the state distribution, transition mechanism, or mapping from state to returns changes.

A simple Markov representation is

\mathbb P(Z_t=j\mid Z_{t-1}=i)=p_{ij}.

This formalizes conditional environments. It does not establish that the states are economically real, correctly identified, or persistent. Structural-break procedures can estimate and test changes in parameters under stated conditions, but they do not reveal future break dates.[8]

A distributionally robust expression makes the logical problem explicit:

\mathcal W_{\rho}(S) = \inf_{Q\in\mathcal U_{\rho}(P_0)} \mathbb E_Q\left[u(R^S)\right],

where \(P_0\) is an estimated distribution, \(\mathcal U_{\rho}(P_0)\) is a defensible uncertainty set around it, and \(u\) is a stated utility or evaluation function. The result depends on the geometry and radius of \(\mathcal U_{\rho}\). An uncertainty set broad enough to include everything is uninformative. One narrow enough to exclude relevant change is comforting but false.

For this reason, regime robustness should not be reduced to showing that a model worked in several historical subperiods. Subperiods are themselves selected, often correlated, and may repeat the same underlying economic state. A credible analysis asks which aspects of the model are expected to remain invariant, which are allowed to move, what evidence would contradict the invariance claim, and how quickly deterioration could be detected.

7. Implementation is a random variable

A modeled return is not the return available to capital.

Write

R_t^{\mathrm{net}} = R_t^{\mathrm{model}} -C_t-L_t-E_t,

where \(C_t\) denotes direct transaction and financing costs, \(L_t\) denotes liquidity and market-impact effects, and \(E_t\) denotes timing, execution, operational, and measurement error.

Treating these terms as fixed deductions is often insufficient. Their distributions may depend on market state, strategy activity, and the same conditions that affect gross return. Even when the expectation decomposes linearly,

\mathbb E[R_t^{\mathrm{net}}] = \mathbb E[R_t^{\mathrm{model}}] - \mathbb E[C_t+L_t+E_t],

risk does not:

\begin{aligned} \operatorname{Var}(R_t^{\mathrm{net}}) ={}& \operatorname{Var}(R_t^{\mathrm{model}}) + \operatorname{Var}(C_t+L_t+E_t)\\ &- 2\operatorname{Cov} \left( R_t^{\mathrm{model}}, C_t+L_t+E_t \right). \end{aligned}

If friction rises when the modeled return is weakest, the covariance term is negative and net risk increases. This is one reason execution cannot be appended as a constant haircut after model selection. The classical optimal-execution literature formalizes the tradeoff between expected cost and uncertainty even in relatively simple impact models.[10]

A robust implementation analysis can be written as

\Psi_{\mathrm{impl}}^{\mathrm{worst}}(S) = \inf_{c\in\mathcal C} \Psi(S;c),

or, when \(c\) is stochastic,

\mathbb P_{c\sim Q_c} \left[ \Psi(S;c)\geq\tau \right] \geq 1-\alpha.

The uncertainty set \(\mathcal C\), the distribution \(Q_c\), and the threshold \(\tau\) are strategy-specific. Publishing them can reveal turnover, liquidity, capacity, holding period, or execution design. They are therefore intentionally absent from this note.

The public principle is sufficient:

Implementation robustness asks whether the inference survives uncertainty in the process that converts a model into realized positions and realized trades.

8. A historical drawdown is one path

Average return and volatility do not describe the sequence by which capital is gained or lost.

For a positive wealth process \(V_t\), define drawdown as

D_t = 1- \frac{V_t}{\max_{0\leq s\leq t}V_s},

and maximum drawdown over horizon \(T\) as

D_T^{\max} = \max_{0\leq t\leq T}D_t.

Maximum drawdown is a path functional. It depends on the order of returns, not only their marginal distribution. Its behavior has been studied analytically even for Brownian motion, where the distribution is already nontrivial.[11]

To demonstrate the difference between one realized drawdown and a drawdown distribution, we generated 100,000 independent three-year paths from a fully disclosed stylized process. Here, \(r_t\) is a daily log return:

\begin{aligned} r_t &= \frac{0.06}{252}+\varepsilon_t,\\ \varepsilon_t &= \sigma_t z_t,\\ u_t &\overset{\mathrm{iid}}{\sim} t_5,\qquad z_t=\sqrt{\frac{3}{5}}\,u_t,\\ \sigma_t^2 &= \omega +0.05\varepsilon_{t-1}^2 +0.93\sigma_{t-1}^2,\\ \omega &= (1-0.05-0.93) \frac{0.12^2}{252}. \end{aligned}

Because \(\operatorname{Var}(u_t)=5/3\), the scaled innovation \(z_t\) has unit variance. Since \(0.05+0.93<1\), the shock process has unconditional daily variance \(0.12^2/252\). The parameters therefore imply 6 percent annualized expected log return and 12 percent unconditional annualized volatility within this stipulated model.[9] They are illustrative assumptions, not estimates from market data, not calibrations to an Atamus strategy, and not Atamus return targets, volatility targets, or risk limits. Each path begins with conditional variance set to the unconditional variance, then uses 756 retained observations after a 500-observation burn-in. The random seed is 20260622.

The displayed path is replication 1. It was not selected by outcome. Its maximum drawdown is 15.99 percent. Across all paths, the median maximum drawdown is 15.39 percent, the 90th percentile is 26.16 percent, the 95th percentile is 30.27 percent, and the 99th percentile is 39.12 percent.

Quantile	Maximum drawdown
1%	6.88%
5%	8.52%
10%	9.60%
25%	11.89%
50%	15.39%
75%	20.30%
90%	26.16%
95%	30.27%
99%	39.12%

The quantiles are Monte Carlo estimates and therefore have simulation uncertainty. Let \(F_D\) denote the population distribution of maximum drawdown under the stipulated process. If \(F_D\) is continuous at its population \(p\)-quantile \(q_p\), so that \(F_D(q_p)=p\), and the simulated paths are independent, then

K_p = \sum_{i=1}^{N} \mathbf 1\!\left\{D_{T,i}^{\max}\leq q_p\right\} \sim \operatorname{Binomial}(N,p).

This gives finite-sample confidence intervals from simulation ranks. The exact binomial-rank intervals have at least 95 percent coverage. Their endpoints, rounded for display, are approximately 15.34 to 15.44 percent for the median, 30.14 to 30.43 percent for the 95th percentile, and 38.74 to 39.52 percent for the 99th percentile. These intervals quantify Monte Carlo error only. They do not quantify uncertainty about whether the stipulated return process is an adequate model of any market or strategy.

Figure 5A

One displayed path

Replication 1. Not selected by outcome.

Figure 5A. One realized path. The wealth index and synchronized drawdown series for replication 1. A fixed annotation states that the path was not selected by outcome. Hovering a date reports wealth, running peak, current drawdown, and maximum drawdown to date.

View data

Displayed path maximum drawdown: 15.99%. Terminal return: 12.32%. The path is replication 1 and was not selected by outcome.

Figure 5B

Maximum drawdown distribution

Stylized 100,000-path experiment

Figure 5B. Distribution of maximum drawdown. The default view is an empirical density with markers for the displayed path, median, 90th, 95th, and 99th percentiles. A control switches to the empirical cumulative distribution. The displayed-path marker remains synchronized with Figure 5A.

View data

Quantile	Maximum drawdown	Monte Carlo rank interval
1%	6.88%	6.83% to 6.93%
5%	8.52%	8.49% to 8.55%
10%	9.60%	9.57% to 9.64%
25%	11.89%	11.85% to 11.93%
50%	15.39%	15.34% to 15.44%
75%	20.30%	20.23% to 20.37%
90%	26.16%	26.05% to 26.28%
95%	30.27%	30.14% to 30.43%
99%	39.12%	38.74% to 39.52%

If the first path were the only observed history, a researcher might report 15.99 percent as though it described the process. In this experiment, 5 percent of otherwise identical paths lose more than 30.27 percent from peak to trough. The historical maximum is evidence about what happened. It is not a deterministic limit on what could happen under the same stipulated process.

A simulated drawdown distribution is also conditional. It inherits the assumptions of the return model, dependence structure, horizon, drift, tail law, and parameter estimates. For stationary weakly dependent data, block methods such as the stationary bootstrap can retain local dependence information that IID resampling discards under their stated regularity conditions. They remain conditional on the observed history and the block-length rule.[7] There is no assumption-free drawdown forecast.

9. Robustness cannot be compressed without loss

The temptation is to combine every diagnostic into one score:

\mathfrak R^{\star} = \sum_{k=1}^{K}w_k\mathfrak R_k.

This can be useful as a generic decision aid, but it can also conceal failure. A high score in parameter stability can offset weak selection control if the weights permit it. Strong historical drawdown behavior can offset implementation uncertainty. The arithmetic may be valid while the decision logic is not.

For public analytical purposes, this note treats robustness as an evidence architecture rather than a ranking number. Each claim should identify at least five objects:

\mathcal E = \left( \text{estimand}, \text{assumptions}, \text{uncertainty set}, \text{diagnostics}, \text{failure conditions} \right).

A robust result is not one that survives every imaginable attack. It is one that survives a set of relevant, documented, and sufficiently severe challenges without relying on hidden changes to the claim.

Several boundaries follow.

First, robustness does not prove profitability. It reduces the set of plausible explanations for an observed result. It cannot eliminate model risk or future change.

Second, more tests do not automatically create more evidence. Tests that reuse the same information, inspect the same failure mode, or were chosen after seeing the result may add little.

Third, a failed robustness test is information. It should not be silently converted into a revised model and forgotten. The failed specification remains part of the research path and therefore part of selection-adjusted inference.

Fourth, robustness is conditional on admissibility. A model should not be required to survive perturbations that contradict its economic or mathematical premise. The burden is to define that boundary before the result is known.

Finally, live observation is not a ceremonial final box. It is a new source of evidence under conditions that simulation cannot completely reproduce. Even then, live evidence remains finite, path-dependent, and subject to selection.

10. Conclusion

Return is an observation. Robustness is the behavior of an inference under admissible alternatives.

A systematic strategy should not be judged by the height of one historical performance estimate. It should be studied through the research search that selected it, the dependence that governs uncertainty for the estimand, the specification geometry around it, the possibility of structural change, the uncertainty of implementation, and the distribution of paths capital may experience.

The calculations in this note show why. In the stylized five-year Gaussian estimator model, selection alone produces an expected annualized Sharpe above 1.1 after 100 independent trials under a zero-skill null. In the 504-observation AR(1) experiment, ignoring dependence reduces nominal 95 percent coverage to approximately 84.9 percent. In the disclosed drawdown experiment, the 95th-percentile maximum drawdown is approximately 1.89 times the displayed path maximum.

None of these facts establishes that a strategy is invalid. They establish that return is not the first question.

The first question is whether the conclusion survives the process required to believe it.

For Atamus Capital, that is the meaning of robustness before return.

Methodology and reproducibility

All numerical results in this note are generated by the accompanying reproducibility package. No market data, Atamus strategy data, or third-party return series is used. The numerical examples are analytical constructions or Monte Carlo estimates under the assumptions stated in the article.

Random seed: 20260622
Selection calculations: exact identities evaluated by adaptive numerical quadrature
Confidence-interval experiment: 100,000 replications per process; standard Newey-West autocovariance normalization \(T^{-1}\) and Bartlett lag \(L=10\)
Drawdown experiment: 100,000 paths, 756 post-burn observations per path, 500-observation burn-in; point estimates use the linear sample-quantile convention; all reported quantiles are finite Monte Carlo estimates conditional on the stated GARCH-Student model, with at-least-95-percent binomial-rank intervals for simulation error
Parameter-surface experiment: exact analytical construction on an 81 by 81 grid
Software environment used for the published data: Python 3.13.5, NumPy 2.3.5, SciPy 1.17.0, pandas 2.2.3, Matplotlib 3.10.8

The graph data file contains calculated analytical values, grid values, the exact sorted model-implied maximum-drawdown sample used by the empirical CDF, and the first simulated drawdown path. The complete experiments can be regenerated from source code. No proprietary Atamus Capital data is present.

References

[1] Halbert White, "A Reality Check for Data Snooping", Econometrica, 68(5), 1097-1126, 2000.

[2] Peter R. Hansen, "A Test for Superior Predictive Ability", Journal of Business & Economic Statistics, 23(4), 365-380, 2005.

[3] Campbell R. Harvey, Yan Liu, and Heqing Zhu, "... and the Cross-Section of Expected Returns", Review of Financial Studies, 29(1), 5-68, 2016.

[4] David H. Bailey and Marcos Lopez de Prado, "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality", Journal of Portfolio Management, 40(5), 94-107, 2014.

[5] Andrew W. Lo, "The Statistics of Sharpe Ratios", Financial Analysts Journal, 58(4), 36-52, 2002.

[6] Whitney K. Newey and Kenneth D. West, "A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix", Econometrica, 55(3), 703-708, 1987.

[7] Dimitris N. Politis and Joseph P. Romano, "The Stationary Bootstrap", Journal of the American Statistical Association, 89(428), 1303-1313, 1994.

[8] Jushan Bai and Pierre Perron, "Computation and Analysis of Multiple Structural Change Models", Journal of Applied Econometrics, 18(1), 1-22, 2003.

[9] Tim Bollerslev, "Generalized Autoregressive Conditional Heteroskedasticity"90063-1), Journal of Econometrics, 31(3), 307-327, 1986.

[10] Robert Almgren and Neil Chriss, "Optimal Execution of Portfolio Transactions", Journal of Risk, 3(2), 5-39, 2001.

[11] Malik Magdon-Ismail, Amir F. Atiya, Amrit Pratap, and Yaser S. Abu-Mostafa, "On the Maximum Drawdown of a Brownian Motion", Journal of Applied Probability, 41(1), 147-161, 2004.

Disclaimer

Research notes published by Atamus Capital are provided for general informational and research purposes only. They do not constitute investment advice, trading advice, a recommendation, an offer to sell, or a solicitation to buy any security, fund interest, account, or investment product.

This note does not disclose Atamus Capital's proprietary strategies, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, portfolio construction methods, execution processes, investment universe, research thresholds, model-development workflow, or investment decisions.