← Back to Research
Mathematical Foundations

The Arithmetic of Multiple Discovery

False discovery control when research becomes a family of tests

Multiplicity Field

A discovery is not only an observation. It is an observation after a selection rule has acted on a research family.

Scope of this note

Atamus Capital does not publish proprietary strategy rules, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, investment universes, portfolio construction methods, execution processes, trade-level information, position-level information, or performance results. This note examines the mathematics of multiple discovery through closed-form calculations, deterministic numerical experiments, and controlled Monte Carlo experiments under fully specified probability laws. No internal hypothesis library, research pipeline, trial count, acceptance threshold, scoring rule, model-selection process, validation sequence, holding period, implementation assumption, or strategy-development workflow is disclosed. The statistical procedures discussed below are public tools for reasoning about multiplicity, false discovery, and selection-adjusted evidence. They are not a description of Atamus Capital’s internal research architecture.

Abstract

A single statistical test asks whether one observed result is surprising under one specified null. A research program asks something different. It asks which results remain credible after many related hypotheses have been considered, ranked, filtered, and interpreted. The arithmetic changes. A 5 percent test is not a 5 percent research process when it is repeated across dozens or hundreds of candidate effects. This note develops the mathematical foundations of multiple discovery for systematic investment research. We distinguish family-wise error control from false discovery rate control, derive the Bonferroni and Sidak corrections, state the Benjamini-Hochberg step-up procedure, examine the effect of dependence on the expected maximum of correlated test statistics, and connect the same logic to the deflated Sharpe ratio and minimum track-record length. The numerical results are analytical or model-based. They are included to study inference under disclosed assumptions, not to describe any Atamus Capital strategy.

1. A discovery is not a single event

Suppose a research process examines a family of hypotheses

\[ H_1, H_2, \ldots, H_m, \]

with corresponding p-values

\[ P_1, P_2, \ldots, P_m. \]

For each hypothesis, there is an unobserved truth state. Some null hypotheses are true and some are false. Let

\[ m_0 = \left\lvert \{i:H_i \text{ is truly null}\}\right\rvert, \qquad m_1=m-m_0. \]

A multiple-testing rule produces a rejection set

\[ \mathcal R \subseteq \{1,\ldots,m\}. \]

Let

\[ R = |\mathcal R|, \]

be the number of discoveries, and let

\[ V = \left\lvert \{i\in\mathcal R:H_i \text{ is truly null}\}\right\rvert, \]

be the number of false discoveries. The number of genuine discoveries is

\[ G=R-V. \]

The single-test question is usually framed as

\[ \mathbb P(\text{reject }H_i \mid H_i\text{ true}) \leq \alpha. \]

The research question is not the same. Once many hypotheses have been examined, we must decide which error quantity is being controlled. Two quantities dominate the discussion:

\[ \operatorname{FWER}=\mathbb P(V\geq 1), \]

and

\[ \operatorname{FDR}=\mathbb E\left[\frac{V}{\max(R,1)}\right]. \]

FWER asks whether the research process makes at least one false discovery. FDR asks what fraction of published discoveries is false on average. These are different questions. The first is useful when a single false claim is intolerable. The second is often more relevant when research intentionally screens a large family of potential effects.

Atamus distinguishes the historical presence of a result from the evidentiary status of a result. A discovery is not only an observation. It is an observation after a selection rule has acted on a research family.

2. The family-wise arithmetic

Assume first that all null hypotheses are true, all p-values are independent, and each p-value is exactly uniform on \([0,1]\) under the null. If every hypothesis is tested at level \(\alpha\), the probability of at least one false discovery is

\[ \operatorname{FWER}(m,\alpha)=1-(1-\alpha)^m. \]

At \(\alpha=0.05\),

\[ \operatorname{FWER}(20,0.05)=1-0.95^{20}=64.1514\%, \]

and

\[ \operatorname{FWER}(100,0.05)=1-0.95^{100}=99.4079\%. \]

With 500 independent null tests, the same calculation gives

\[ \operatorname{FWER}(500,0.05)=99.9999999993\%. \]

This does not say that real research consists of independent null hypotheses. It says something simpler and more damaging to naive interpretation: ordinary single-test significance does not scale to a research family.

Figure 1
Family-wise error inflation
A single-test level does not scale to a research family
Figure 1. Family-wise error inflation. The unadjusted curve shows 1 − (1 − α)m under independent null p-values. Bonferroni and Sidak lines show how per-test thresholds control family-wise error under their respective assumptions. No market or strategy data is used.
View data
AlphamUnadjusted FWERBonferroni FWERSidak FWERBonferroni thresholdSidak threshold
1%11.0000%1.0000%1.0000%0.0100000.010000
1%2018.2093%0.9953%1.0000%0.0005000.000502
1%10063.3968%0.9951%1.0000%0.0001000.000100
1%50099.3430%0.9950%1.0000%0.0000200.000020
1%100099.9957%0.9950%1.0000%0.0000100.000010
5%15.0000%5.0000%5.0000%0.0500000.050000
5%2064.1514%4.8830%5.0000%0.0025000.002561
5%10099.4079%4.8782%5.0000%0.0005000.000513
5%500100.0000%4.8773%5.0000%0.0001000.000103
5%1000100.0000%4.8772%5.0000%0.0000500.000051
10%110.0000%10.0000%10.0000%0.1000000.100000
10%2087.8423%9.5390%10.0000%0.0050000.005254
10%10099.9973%9.5208%10.0000%0.0010000.001053
10%500100.0000%9.5172%10.0000%0.0002000.000211
10%1000100.0000%9.5167%10.0000%0.0001000.000105

Bonferroni control

The Bonferroni correction follows from the union bound. For any events \(A_1,\ldots,A_m\),

\[ \mathbb P\left(\bigcup_{i=1}^m A_i\right)\leq\sum_{i=1}^m \mathbb P(A_i). \]

Let \(A_i\) be the event that the \(i\)th true null is rejected. Testing each hypothesis at level \(\alpha/m\) gives

\[ \operatorname{FWER}=\mathbb P(V\geq1)\leq\sum_{i:H_i\text{ true}}\mathbb P(P_i\leq \alpha/m)\leq m_0\frac{\alpha}{m}\leq \alpha. \]

Bonferroni does not require independence. That is its strength. Its cost is conservatism, especially when hypotheses are dependent or when the research family is large.

For \(m=100\) and \(\alpha=0.05\), the Bonferroni per-test threshold is

\[ \frac{0.05}{100}=0.000500. \]

Sidak control

Under independence, the Sidak threshold is slightly less conservative. Choose a per-test threshold \(\alpha_S\) such that

\[ 1-(1-\alpha_S)^m=\alpha. \]

Solving gives

\[ \alpha_S=1-(1-\alpha)^{1/m}. \]

For \(m=100\) and \(\alpha=0.05\),

\[ \alpha_S=0.000512801. \]

The difference from Bonferroni is small at conventional levels, but the conceptual difference matters. Bonferroni is a bound valid under arbitrary dependence. Sidak is exact under the independent complete-null model and conservative when fewer than m null hypotheses are true, provided the relevant null p-values are independent.

3. Why family-wise control is not the whole problem

FWER control is severe. That severity is appropriate when a research process must avoid even one false positive. But systematic research often has a different public question. If a broad search identifies a set of candidate effects, we may care less about the probability that one of them is false and more about the expected false fraction among the selected effects.

The realized false discovery proportion is

\[ \operatorname{FDP}=\frac{V}{\max(R,1)}. \]

Its expectation is the false discovery rate:

\[ \operatorname{FDR}=\mathbb E[\operatorname{FDP}]. \]

The distinction is critical. FDR does not guarantee that every realized research batch has a small false fraction. It controls the expected fraction across repeated applications of the procedure. A realized batch can have an FDP above the target even when the procedure controls FDR exactly.

This is why the error criterion must be named. A p-value alone does not say whether selection has been controlled. A list of p-values does not say whether discoveries have been controlled. The research process must define the family, the nulls, the rule, and the error rate.

4. The Benjamini-Hochberg procedure

Let the p-values be ordered as

\[ P_{(1)}\leq P_{(2)}\leq\cdots\leq P_{(m)}. \]

For a target FDR level \(q\in(0,1)\), the Benjamini-Hochberg procedure defines

\[ k^*=\max\left\{k\in\{1,\ldots,m\}:P_{(k)}\leq \frac{k}{m}q\right\}. \]

If the set is nonempty, reject all hypotheses corresponding to

\[ P_{(1)},\ldots,P_{(k^*)}. \]

If no such \(k\) exists, reject none.

The procedure is step-up. It does not compare each p-value only to the same threshold. It compares the \(k\)th smallest p-value to a threshold that grows linearly with \(k\):

\[ \frac{k}{m}q. \]

Under independence of the full p-value vector, with valid null p-values, the procedure controls

\[ \operatorname{FDR}\leq \frac{m_0}{m}q\leq q. \]

The ratio \(m_0/m\) appears because false discoveries can only arise from true null hypotheses. If all hypotheses are null, then \(m_0=m\) and the upper bound is \(q\). If some hypotheses are genuinely non-null, the bound tightens to \((m_0/m)q\).

A useful proof sketch is to condition on the p-values other than a particular true null p-value \(P_i\). Under independence, \(P_i\) remains uniform and independent of the selection threshold induced by the other p-values. More formally, one sums the contribution of each true null over possible rejection counts while preserving the step-up self-consistency condition. This yields

\[ \mathbb E\left[\frac{V}{\max(R,1)}\right]\leq\sum_{i:H_i\text{ true}}\frac{q}{m}=\frac{m_0}{m}q. \]

The important message is not that BH is universally optimal. It is that discovery control can be formalized. A research family can be studied as a stochastic object rather than as a collection of persuasive anecdotes.

Figure 2
Benjamini-Hochberg step-up geometry
Discoveries are selected by an ordered boundary, not by a single p-value
Figure 2. Benjamini-Hochberg step-up geometry. Model-implied p-values are sorted and compared with the line kq/m. The final accepted rank defines the discovery set. The displayed truth labels exist only because the experiment is model-based and fully specified. No market or strategy data is used.
View data
qk*Largest rejected pBH boundaryTrue discoveriesFalse discoveriesRealized FDP
5%90.0008910.0009008111.11%
10%110.0013380.0022001019.09%
20%130.0043270.0052001217.69%

5. Dependence changes the arithmetic

Financial research rarely produces independent hypotheses. Candidate effects may share data, instruments, transformations, regimes, risk premia, calendar structure, execution assumptions, or economic mechanisms. Dependence does not invalidate multiple-testing mathematics. It changes the assumptions under which a particular correction is valid.

The ordinary BH procedure also controls FDR under certain positive-dependence conditions, usually stated through positive regression dependence on the subset of true nulls. When dependence is unknown or arbitrary, Benjamini and Yekutieli introduced a conservative modification. Let

\[ H_m=\sum_{j=1}^m \frac{1}{j}. \]

The adjusted step-up boundary is

\[ P_{(k)}\leq \frac{k}{m}\frac{q}{H_m}. \]

Because

\[ H_m\approx \log(m)+\gamma, \]

where \(\gamma\) is the Euler-Mascheroni constant, this correction becomes materially more conservative as the research family grows. For \(m=500\),

\[ H_{500}\approx 6.7928. \]

Thus a nominal \(q=10\%\) arbitrary-dependence adjustment uses an effective boundary of approximately

\[ \frac{10\%}{6.7928}=1.472\%, \]

inside the BH line.

This conservatism is not a flaw. It is the price of making fewer assumptions about dependence. In practice, dependence structure is not a detail. It is part of the evidence. A research note that names only the number of tests, but not their dependence structure, has not fully specified the probability problem.

6. The expected maximum of correlated test statistics

Selection bias can be written as an extreme-value problem. Suppose a family of test statistics satisfies

\[ \mathbf T=(T_1,\ldots,T_m)^\top\sim \mathcal N(\mathbf 0,\Sigma) \]

under the complete null. The maximum statistic is

\[ M_m=\max_{1\leq i\leq m}T_i. \]

Its distribution is

\[ \mathbb P(M_m\leq x)=\Phi_m(x\mathbf 1;\Sigma), \]

where \(\Phi_m(\cdot;\Sigma)\) is the \(m\)-variate Gaussian distribution function.

In the independent standard-normal case,

\[ \mathbb E[M_m]=m\int_{-\infty}^\infty z\phi(z)\Phi(z)^{m-1}\,dz. \]

An equivalent numerically stable identity is

\[ \mathbb E[M_m]=\int_0^\infty \left[1-\Phi(x)^m-(1-\Phi(x))^m\right]dx. \]

For selected values,

\[ \mathbb E[M_{10}]=1.538753,\qquad\mathbb E[M_{100}]=2.507594,\qquad\mathbb E[M_{1000}]=3.241436. \]

These are expected maxima under the complete null. They are not expected skill. They are what selection can manufacture from noise.

For an equicorrelated Gaussian family with common correlation \(\rho\in[0,1)\),

\[ T_i=\sqrt\rho\,U+\sqrt{1-\rho}\,\varepsilon_i, \]

where

\[ U,\varepsilon_1,\ldots,\varepsilon_m\overset{ind}{\sim}\mathcal N(0,1). \]

Then

\[ \max_i T_i=\sqrt\rho\,U+\sqrt{1-\rho}\max_i\varepsilon_i. \]

Taking expectations gives

\[ \mathbb E\left[\max_i T_i\right]=\sqrt{1-\rho}\,\mathbb E[M_m]. \]

This simple case shows why dependence matters. Positive common dependence reduces the expected maximum relative to independence, but it does not make selection disappear unless \(\rho=1\). The effective breadth of a research search is therefore neither the raw count of trials nor zero. It is a function of the dependence structure.

Figure 4
Expected maximum under dependence
Correlation reduces the null maximum, but selection inflation remains
Figure 4. Expected maximum under dependence. In the equicorrelated Gaussian null model, positive common correlation reduces the expected selected statistic by sqrt(1 - rho) relative to independence, but selection inflation remains unless the candidates are perfectly common. No market or strategy data is used.
View data
mrhoExpected maximum
100.001.538753
1000.002.507594
10000.003.241436
100.251.332599
1000.252.171640
10000.252.807166
100.501.088062
1000.501.773136
10000.502.292041
100.750.769376
1000.751.253797
10000.751.620718

7. Sharpe ratios are also discoveries

In finance, selection often appears as a performance statistic rather than a p-value. A strategy with an attractive Sharpe ratio may be selected from a large research family. The selected Sharpe is not distributed like a pre-specified Sharpe.

Let \(\widehat{SR}_p\) be an estimated per-period Sharpe ratio and let \(SR_p^*\) be a per-period benchmark Sharpe that the result must exceed. Under the probabilistic Sharpe ratio framework, a stylized finite-sample statistic can be written as

\[ \operatorname{PSR}(SR_p^*)=\Phi\left(\frac{(\widehat{SR}_p-SR_p^*)\sqrt{T-1}}{\sqrt{1-\widehat\gamma_3\widehat{SR}_p+\frac{\widehat\gamma_4-1}{4}\widehat{SR}_p^2}}\right), \]

where \(T\) is the sample length, \(\widehat\gamma_3\) is estimated skewness, and \(\widehat\gamma_4\) is estimated raw kurtosis, not excess kurtosis. The formula is written on the same periodic scale as the return observations. The deflated Sharpe ratio replaces the benchmark \(SR_p^*\) with a multiple-testing threshold that reflects the expected best result among many trials and adjusts for non-normality.

This is the finance-specific version of the same arithmetic. If a Sharpe ratio is selected from many candidates, the relevant benchmark is not zero. It is the performance level that selection alone could plausibly produce.

The full deflated Sharpe ratio requires assumptions about the number of trials, their dependence, the estimation variance of Sharpe ratios, skewness, and kurtosis. The version below is deliberately not presented as the full production DSR formula. It is a PSR-style Gaussian-normal specialization that exposes the arithmetic of selection while avoiding any Atamus-specific assumptions. It should be read as an analytical approximation based on the probabilistic Sharpe framework, not as the exact finite-sample distribution of the sample Sharpe ratio. For transparency, Figure 5 uses:

  1. returns are IID Gaussian;
  2. annualization is \(A=252\);
  3. trial Sharpe estimates are independent under the complete null;
  4. the multiple-testing benchmark is the exact expected maximum of \(N\) independent standard-normal trial statistics;
  5. the target confidence is \(95\%\).

Let \(S_A\) denote the observed annualized Sharpe ratio and let \(Y\) be the number of years. The periodic Sharpe estimate is

\[ \widehat s=\frac{S_A}{\sqrt A}, \]

and the selected-null periodic benchmark is

\[ s_N^*(Y)=\frac{\mathbb E[M_N]}{\sqrt{AY-1}},\qquad Y>1/A. \]

The adjusted probability that the observed result exceeds the selected-null benchmark is

\[ \operatorname{PSR}_N(S_A,Y)=\Phi\left(\frac{\frac{S_A}{\sqrt A}\sqrt{AY-1}-\mathbb E[M_N]}{\sqrt{1+\frac{S_A^2}{2A}}}\right),\qquad Y>1/A. \]

Define the minimum track-record length as

\[ \operatorname{MinTRL}(S_A,N,p)=\inf\{Y>1/A:\operatorname{PSR}_N(S_A,Y)\geq p\}. \]

In this PSR-style Gaussian-normal specialization, for \(S_A>0\) and \(p>1/2\), with \(z_p=\Phi^{-1}(p)\), the infimum has the following closed form:

\[ \operatorname{MinTRL}(S_A,N,p)=\frac{1}{A}+\left(\frac{\mathbb E[M_N]+z_p\sqrt{1+\frac{S_A^2}{2A}}}{S_A}\right)^2,\qquad S_A>0. \]

For an observed annualized Sharpe of \(S_A=1.50\) and \(p=0.95\), the Gaussian benchmark calculation gives:

\[ \begin{array}{c|c} N & \operatorname{MinTRL}(1.50,N,0.95) \\ \hline 1 & 1.2118\text{ years} \\ 10 & 4.5190\text{ years} \\ 100 & 7.6810\text{ years} \\ 1000 & 10.6314\text{ years} \end{array} \]

The interpretation is direct. The more opportunities a research process has to select an attractive Sharpe, the more history is required to separate skill from selection under the stated assumptions.

This is not an Atamus Capital acceptance rule. It is not a recommended investment threshold. It is not an estimate of any Atamus strategy. It is a public mathematical illustration of why track record length and research breadth cannot be separated.

Figure 5
Minimum track-record length after selection
More trials require more history to separate skill from selection
Figure 5. Minimum track-record length after selection. The chart reports the years required for a selected annualized Sharpe ratio to reach 95 percent adjusted probabilistic confidence under a disclosed PSR-style IID Gaussian-normal specialization. This is an analytical approximation, not an exact finite-sample law and not an Atamus acceptance rule. No market or strategy data is used.
View data
Independent trialsAnnualized SharpeMinimum years
11.002.7149 years
11.501.2118 years
12.000.6857 years
101.0010.1497 years
101.504.5190 years
102.002.5482 years
1001.0017.2603 years
1001.507.6810 years
1002.004.3282 years
10001.0023.8957 years
10001.5010.6314 years
10002.005.9889 years

8. A controlled FDR experiment

To make the distinction between FWER and FDR concrete, we run a reproducible controlled experiment under stated assumptions. There is no market data and no Atamus data.

Each replication contains

\[ m=500 \]

hypotheses. Of these,

\[ m_0=450 \]

are true nulls with p-values distributed as

\[ P_i\sim \operatorname{Uniform}(0,1), \]

and

\[ m_1=50 \]

are alternatives with p-values distributed as

\[ P_i\sim \operatorname{Beta}(0.25,1). \]

The alternatives are deliberately stylized. They create a population with real effects, but the p-values remain model-implied. The experiment uses 250,000 independent replications with seed 20260627.

At \(q=10\%\), the independent BH theoretical upper bound is

\[ \frac{m_0}{m}q=0.90\times 10\%=9.0000\%. \]

The Monte Carlo estimate is

\[ \widehat{\operatorname{FDR}}=8.9790\%, \]

with Monte Carlo standard error

\[ 0.0171\text{ percentage points}. \]

The same simulation estimates expected discoveries as

\[ \mathbb E[R]\approx 12.5659, \]

with expected false discoveries

\[ \mathbb E[V]\approx 1.2561, \]

and expected genuine discoveries

\[ \mathbb E[G]\approx 11.3099. \]

Power, defined here as expected genuine discoveries divided by 50 alternatives, is

\[ 22.6197\%. \]
Figure 3
FDR and power under a controlled research family
Controlling expected false fraction while preserving discovery power
Figure 3. FDR and power in a controlled research family. The simulation contains 450 true nulls and 50 stylized alternatives across 250,000 replications. The BH procedure controls the expected false discovery proportion while power rises with q. No market or strategy data is used.
View data
qMonte Carlo FDRTheoretical boundPowerExpected discoveriesExpected false discoveriesExpected true discoveries
1%0.8947%0.9000%10.4412%5.27960.05905.2206
2%1.7857%1.8000%13.0832%6.68510.14356.5416
3%2.6838%2.7000%14.9521%7.71940.24337.4760
4%3.5782%3.6000%16.4629%8.58730.35598.2315
5%4.4787%4.5000%17.7542%9.35780.48078.8771
6%5.3926%5.4000%18.9044%10.06960.61749.4522
7%6.2868%6.3000%19.9380%10.73100.76199.9690
8%7.1897%7.2000%20.9015%11.36900.918210.4507
9%8.0824%8.1000%21.7869%11.97561.082110.8935
10%8.9790%9.0000%22.6197%12.56591.256111.3099
11%9.8719%9.9000%23.4148%13.14761.440211.7074
12%10.7660%10.8000%24.1770%13.72271.634212.0885
13%11.6545%11.7000%24.9036%14.28851.836612.4518
14%12.5577%12.6000%25.6035%14.85392.052112.8017
15%13.4562%13.5000%26.2818%15.41822.277313.1409
16%14.3649%14.4000%26.9353%15.98152.513813.4677
17%15.2586%15.3000%27.5748%16.54682.759313.7874
18%16.1506%16.2000%28.1956%17.11163.013814.0978
19%17.0415%17.1000%28.8021%17.68063.279514.4011
20%17.9401%18.0000%29.3993%18.25793.558314.6996

This is not a claim about markets. It is a controlled experiment showing that FDR control is a statement about the expected composition of discoveries, not a guarantee about every realized discovery set.

For the same \(q=10\%\) experiment, the median realized FDP is

\[ 8.3333\%, \]

while the 95th percentile is

\[ 25.0000\%. \]

The probability that a realized batch has

\[ \operatorname{FDP}>q \]

is

\[ 39.8344\%. \]

This does not contradict FDR control. It explains it. FDR is an expectation. A realized discovery set can be worse than the target while the procedure remains correct in expectation.

Figure 6
Realized false discovery proportions
FDR is an expectation, not a pathwise guarantee
Mean FDP8.98%Median FDP8.33%95th percentile25.00%FDP > q39.83%No false discoveries34.70%
Figure 6. Realized false discovery proportions. FDR is an expectation, not a guarantee that every realized discovery batch has a false fraction below q. The histogram is generated from the same controlled BH experiment. No market or strategy data is used.
View data
MetricValue
Mean FDP8.9790%
Median FDP8.3333%
90th percentile FDP20.0000%
95th percentile FDP25.0000%
99th percentile FDP33.3333%
Probability FDP > q39.8344%
Probability FDP > 2q9.7080%

9. What the arithmetic prevents

Multiple-discovery arithmetic prevents a common error: treating the final survivor of a search as if it had been specified in advance.

A selected result is conditioned on the family that produced it:

\[ \widehat\theta_{selected}=\widehat\theta_J,\qquad J=\arg\max_{1\leq j\leq m}T_j. \]

Even if

\[ \theta_1=\theta_2=\cdots=\theta_m=0, \]

under a nondegenerate symmetric null family with more than one effectively independent candidate, we generally have

\[ \mathbb E[T_J]>0. \]

The result is not fraud. It is arithmetic. Searching changes the distribution of the selected statistic.

This is why Atamus treats research output as a conditional object. A result is conditioned on data, assumptions, model specification, dependence, search breadth, selection rule, implementation model, and risk constraints. Multiple discovery is the part of that problem concerned with the number and structure of claims examined before the surviving claim is named.

10. The institutional standard

The institutional standard is not to avoid research breadth. Serious quantitative research must examine alternatives, challenge assumptions, and discard weak claims. The standard is to account for that breadth.

A research process that tests one hypothesis at 5 percent is not equivalent to a process that tests 500 hypotheses at 5 percent and publishes the most attractive survivor. The latter must answer additional questions:

  1. What is the research family?
  2. Which hypotheses were tested?
  3. How dependent are they?
  4. What error criterion is being controlled?
  5. Is the target FWER, FDR, a deflated performance statistic, or another quantity?
  6. How much track record is required after accounting for selection?
  7. Which results remain credible under genuinely unseen data?

These questions do not reveal the source of an investment edge. They define the minimum public language of serious evidence.

11. Conclusion

A discovery is not just a small p-value. A selected Sharpe ratio is not just a Sharpe ratio. Both are outputs of a research family.

When the number of hypotheses grows, the probability of false discovery grows. Bonferroni and Sidak control the probability of at least one false discovery. Benjamini-Hochberg controls the expected false fraction among discoveries under its assumptions. Dependence changes the required correction. The expected maximum of a research family rises with search breadth. In finance, the same issue appears through selected Sharpe ratios, the deflated Sharpe ratio, and minimum track-record length.

The arithmetic is not optional. Without it, research can mistake selection for evidence. With it, a result must earn its status after the search that produced it has been brought back into the probability model.

Atamus does not publish this framework to disclose strategy mechanics. We publish it because any serious systematic research organization must understand the mathematics of discovery before it can speak responsibly about evidence.

References

[1] Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.

[2] Benjamini, Y., and Yekutieli, D. (2001). The Control of the False Discovery Rate in Multiple Testing Under Dependency. Annals of Statistics, 29(4), 1165-1188.

[3] Dunn, O. J. (1961). Multiple Comparisons Among Means. Journal of the American Statistical Association, 56(293), 52-64.

[4] Sidak, Z. (1967). Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the American Statistical Association, 62(318), 626-633.

[5] White, H. (2000). A Reality Check for Data Snooping. Econometrica, 68(5), 1097-1126.

[6] Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business and Economic Statistics, 23(4), 365-380.

[7] Bailey, D. H., and Lopez de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality. Journal of Portfolio Management, 40(5), 94-107.

[8] Lo, A. W. (2002). The Statistics of Sharpe Ratios. Financial Analysts Journal, 58(4), 36-52.

[9] Harvey, C. R., Liu, Y., and Zhu, H. (2016). ... and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1), 5-68.

[10] Storey, J. D. (2002). A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society: Series B, 64(3), 479-498.

Disclaimer

Research notes published by Atamus Capital are provided for general informational and research purposes only. They do not constitute investment advice, trading advice, a recommendation, an offer to sell, or a solicitation to buy any security, fund interest, account, or investment product.

This note does not disclose Atamus Capital's proprietary strategies, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, portfolio construction methods, execution processes, investment universe, research thresholds, model-development workflow, or investment decisions.