A discovery is not only an observation. It is an observation after a selection rule has acted on a research family.
Atamus Capital does not publish proprietary strategy rules, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, investment universes, portfolio construction methods, execution processes, trade-level information, position-level information, or performance results. This note examines the mathematics of multiple discovery through closed-form calculations, deterministic numerical experiments, and controlled Monte Carlo experiments under fully specified probability laws. No internal hypothesis library, research pipeline, trial count, acceptance threshold, scoring rule, model-selection process, validation sequence, holding period, implementation assumption, or strategy-development workflow is disclosed. The statistical procedures discussed below are public tools for reasoning about multiplicity, false discovery, and selection-adjusted evidence. They are not a description of Atamus Capital’s internal research architecture.
A single statistical test asks whether one observed result is surprising under one specified null. A research program asks something different. It asks which results remain credible after many related hypotheses have been considered, ranked, filtered, and interpreted. The arithmetic changes. A 5 percent test is not a 5 percent research process when it is repeated across dozens or hundreds of candidate effects. This note develops the mathematical foundations of multiple discovery for systematic investment research. We distinguish family-wise error control from false discovery rate control, derive the Bonferroni and Sidak corrections, state the Benjamini-Hochberg step-up procedure, examine the effect of dependence on the expected maximum of correlated test statistics, and connect the same logic to the deflated Sharpe ratio and minimum track-record length. The numerical results are analytical or model-based. They are included to study inference under disclosed assumptions, not to describe any Atamus Capital strategy.
1. A discovery is not a single event
Suppose a research process examines a family of hypotheses
with corresponding p-values
For each hypothesis, there is an unobserved truth state. Some null hypotheses are true and some are false. Let
A multiple-testing rule produces a rejection set
Let
be the number of discoveries, and let
be the number of false discoveries. The number of genuine discoveries is
The single-test question is usually framed as
The research question is not the same. Once many hypotheses have been examined, we must decide which error quantity is being controlled. Two quantities dominate the discussion:
and
FWER asks whether the research process makes at least one false discovery. FDR asks what fraction of published discoveries is false on average. These are different questions. The first is useful when a single false claim is intolerable. The second is often more relevant when research intentionally screens a large family of potential effects.
Atamus distinguishes the historical presence of a result from the evidentiary status of a result. A discovery is not only an observation. It is an observation after a selection rule has acted on a research family.
2. The family-wise arithmetic
Assume first that all null hypotheses are true, all p-values are independent, and each p-value is exactly uniform on \([0,1]\) under the null. If every hypothesis is tested at level \(\alpha\), the probability of at least one false discovery is
At \(\alpha=0.05\),
and
With 500 independent null tests, the same calculation gives
This does not say that real research consists of independent null hypotheses. It says something simpler and more damaging to naive interpretation: ordinary single-test significance does not scale to a research family.
View data
| Alpha | m | Unadjusted FWER | Bonferroni FWER | Sidak FWER | Bonferroni threshold | Sidak threshold |
|---|---|---|---|---|---|---|
| 1% | 1 | 1.0000% | 1.0000% | 1.0000% | 0.010000 | 0.010000 |
| 1% | 20 | 18.2093% | 0.9953% | 1.0000% | 0.000500 | 0.000502 |
| 1% | 100 | 63.3968% | 0.9951% | 1.0000% | 0.000100 | 0.000100 |
| 1% | 500 | 99.3430% | 0.9950% | 1.0000% | 0.000020 | 0.000020 |
| 1% | 1000 | 99.9957% | 0.9950% | 1.0000% | 0.000010 | 0.000010 |
| 5% | 1 | 5.0000% | 5.0000% | 5.0000% | 0.050000 | 0.050000 |
| 5% | 20 | 64.1514% | 4.8830% | 5.0000% | 0.002500 | 0.002561 |
| 5% | 100 | 99.4079% | 4.8782% | 5.0000% | 0.000500 | 0.000513 |
| 5% | 500 | 100.0000% | 4.8773% | 5.0000% | 0.000100 | 0.000103 |
| 5% | 1000 | 100.0000% | 4.8772% | 5.0000% | 0.000050 | 0.000051 |
| 10% | 1 | 10.0000% | 10.0000% | 10.0000% | 0.100000 | 0.100000 |
| 10% | 20 | 87.8423% | 9.5390% | 10.0000% | 0.005000 | 0.005254 |
| 10% | 100 | 99.9973% | 9.5208% | 10.0000% | 0.001000 | 0.001053 |
| 10% | 500 | 100.0000% | 9.5172% | 10.0000% | 0.000200 | 0.000211 |
| 10% | 1000 | 100.0000% | 9.5167% | 10.0000% | 0.000100 | 0.000105 |
Bonferroni control
The Bonferroni correction follows from the union bound. For any events \(A_1,\ldots,A_m\),
Let \(A_i\) be the event that the \(i\)th true null is rejected. Testing each hypothesis at level \(\alpha/m\) gives
Bonferroni does not require independence. That is its strength. Its cost is conservatism, especially when hypotheses are dependent or when the research family is large.
For \(m=100\) and \(\alpha=0.05\), the Bonferroni per-test threshold is
Sidak control
Under independence, the Sidak threshold is slightly less conservative. Choose a per-test threshold \(\alpha_S\) such that
Solving gives
For \(m=100\) and \(\alpha=0.05\),
The difference from Bonferroni is small at conventional levels, but the conceptual difference matters. Bonferroni is a bound valid under arbitrary dependence. Sidak is exact under the independent complete-null model and conservative when fewer than m null hypotheses are true, provided the relevant null p-values are independent.
3. Why family-wise control is not the whole problem
FWER control is severe. That severity is appropriate when a research process must avoid even one false positive. But systematic research often has a different public question. If a broad search identifies a set of candidate effects, we may care less about the probability that one of them is false and more about the expected false fraction among the selected effects.
The realized false discovery proportion is
Its expectation is the false discovery rate:
The distinction is critical. FDR does not guarantee that every realized research batch has a small false fraction. It controls the expected fraction across repeated applications of the procedure. A realized batch can have an FDP above the target even when the procedure controls FDR exactly.
This is why the error criterion must be named. A p-value alone does not say whether selection has been controlled. A list of p-values does not say whether discoveries have been controlled. The research process must define the family, the nulls, the rule, and the error rate.
4. The Benjamini-Hochberg procedure
Let the p-values be ordered as
For a target FDR level \(q\in(0,1)\), the Benjamini-Hochberg procedure defines
If the set is nonempty, reject all hypotheses corresponding to
If no such \(k\) exists, reject none.
The procedure is step-up. It does not compare each p-value only to the same threshold. It compares the \(k\)th smallest p-value to a threshold that grows linearly with \(k\):
Under independence of the full p-value vector, with valid null p-values, the procedure controls
The ratio \(m_0/m\) appears because false discoveries can only arise from true null hypotheses. If all hypotheses are null, then \(m_0=m\) and the upper bound is \(q\). If some hypotheses are genuinely non-null, the bound tightens to \((m_0/m)q\).
A useful proof sketch is to condition on the p-values other than a particular true null p-value \(P_i\). Under independence, \(P_i\) remains uniform and independent of the selection threshold induced by the other p-values. More formally, one sums the contribution of each true null over possible rejection counts while preserving the step-up self-consistency condition. This yields
The important message is not that BH is universally optimal. It is that discovery control can be formalized. A research family can be studied as a stochastic object rather than as a collection of persuasive anecdotes.
View data
| q | k* | Largest rejected p | BH boundary | True discoveries | False discoveries | Realized FDP |
|---|---|---|---|---|---|---|
| 5% | 9 | 0.000891 | 0.000900 | 8 | 1 | 11.11% |
| 10% | 11 | 0.001338 | 0.002200 | 10 | 1 | 9.09% |
| 20% | 13 | 0.004327 | 0.005200 | 12 | 1 | 7.69% |
5. Dependence changes the arithmetic
Financial research rarely produces independent hypotheses. Candidate effects may share data, instruments, transformations, regimes, risk premia, calendar structure, execution assumptions, or economic mechanisms. Dependence does not invalidate multiple-testing mathematics. It changes the assumptions under which a particular correction is valid.
The ordinary BH procedure also controls FDR under certain positive-dependence conditions, usually stated through positive regression dependence on the subset of true nulls. When dependence is unknown or arbitrary, Benjamini and Yekutieli introduced a conservative modification. Let
The adjusted step-up boundary is
Because
where \(\gamma\) is the Euler-Mascheroni constant, this correction becomes materially more conservative as the research family grows. For \(m=500\),
Thus a nominal \(q=10\%\) arbitrary-dependence adjustment uses an effective boundary of approximately
inside the BH line.
This conservatism is not a flaw. It is the price of making fewer assumptions about dependence. In practice, dependence structure is not a detail. It is part of the evidence. A research note that names only the number of tests, but not their dependence structure, has not fully specified the probability problem.
6. The expected maximum of correlated test statistics
Selection bias can be written as an extreme-value problem. Suppose a family of test statistics satisfies
under the complete null. The maximum statistic is
Its distribution is
where \(\Phi_m(\cdot;\Sigma)\) is the \(m\)-variate Gaussian distribution function.
In the independent standard-normal case,
An equivalent numerically stable identity is
For selected values,
These are expected maxima under the complete null. They are not expected skill. They are what selection can manufacture from noise.
For an equicorrelated Gaussian family with common correlation \(\rho\in[0,1)\),
where
Then
Taking expectations gives
This simple case shows why dependence matters. Positive common dependence reduces the expected maximum relative to independence, but it does not make selection disappear unless \(\rho=1\). The effective breadth of a research search is therefore neither the raw count of trials nor zero. It is a function of the dependence structure.
View data
| m | rho | Expected maximum |
|---|---|---|
| 10 | 0.00 | 1.538753 |
| 100 | 0.00 | 2.507594 |
| 1000 | 0.00 | 3.241436 |
| 10 | 0.25 | 1.332599 |
| 100 | 0.25 | 2.171640 |
| 1000 | 0.25 | 2.807166 |
| 10 | 0.50 | 1.088062 |
| 100 | 0.50 | 1.773136 |
| 1000 | 0.50 | 2.292041 |
| 10 | 0.75 | 0.769376 |
| 100 | 0.75 | 1.253797 |
| 1000 | 0.75 | 1.620718 |
7. Sharpe ratios are also discoveries
In finance, selection often appears as a performance statistic rather than a p-value. A strategy with an attractive Sharpe ratio may be selected from a large research family. The selected Sharpe is not distributed like a pre-specified Sharpe.
Let \(\widehat{SR}_p\) be an estimated per-period Sharpe ratio and let \(SR_p^*\) be a per-period benchmark Sharpe that the result must exceed. Under the probabilistic Sharpe ratio framework, a stylized finite-sample statistic can be written as
where \(T\) is the sample length, \(\widehat\gamma_3\) is estimated skewness, and \(\widehat\gamma_4\) is estimated raw kurtosis, not excess kurtosis. The formula is written on the same periodic scale as the return observations. The deflated Sharpe ratio replaces the benchmark \(SR_p^*\) with a multiple-testing threshold that reflects the expected best result among many trials and adjusts for non-normality.
This is the finance-specific version of the same arithmetic. If a Sharpe ratio is selected from many candidates, the relevant benchmark is not zero. It is the performance level that selection alone could plausibly produce.
The full deflated Sharpe ratio requires assumptions about the number of trials, their dependence, the estimation variance of Sharpe ratios, skewness, and kurtosis. The version below is deliberately not presented as the full production DSR formula. It is a PSR-style Gaussian-normal specialization that exposes the arithmetic of selection while avoiding any Atamus-specific assumptions. It should be read as an analytical approximation based on the probabilistic Sharpe framework, not as the exact finite-sample distribution of the sample Sharpe ratio. For transparency, Figure 5 uses:
- returns are IID Gaussian;
- annualization is \(A=252\);
- trial Sharpe estimates are independent under the complete null;
- the multiple-testing benchmark is the exact expected maximum of \(N\) independent standard-normal trial statistics;
- the target confidence is \(95\%\).
Let \(S_A\) denote the observed annualized Sharpe ratio and let \(Y\) be the number of years. The periodic Sharpe estimate is
and the selected-null periodic benchmark is
The adjusted probability that the observed result exceeds the selected-null benchmark is
Define the minimum track-record length as
In this PSR-style Gaussian-normal specialization, for \(S_A>0\) and \(p>1/2\), with \(z_p=\Phi^{-1}(p)\), the infimum has the following closed form:
For an observed annualized Sharpe of \(S_A=1.50\) and \(p=0.95\), the Gaussian benchmark calculation gives:
The interpretation is direct. The more opportunities a research process has to select an attractive Sharpe, the more history is required to separate skill from selection under the stated assumptions.
This is not an Atamus Capital acceptance rule. It is not a recommended investment threshold. It is not an estimate of any Atamus strategy. It is a public mathematical illustration of why track record length and research breadth cannot be separated.
View data
| Independent trials | Annualized Sharpe | Minimum years |
|---|---|---|
| 1 | 1.00 | 2.7149 years |
| 1 | 1.50 | 1.2118 years |
| 1 | 2.00 | 0.6857 years |
| 10 | 1.00 | 10.1497 years |
| 10 | 1.50 | 4.5190 years |
| 10 | 2.00 | 2.5482 years |
| 100 | 1.00 | 17.2603 years |
| 100 | 1.50 | 7.6810 years |
| 100 | 2.00 | 4.3282 years |
| 1000 | 1.00 | 23.8957 years |
| 1000 | 1.50 | 10.6314 years |
| 1000 | 2.00 | 5.9889 years |
8. A controlled FDR experiment
To make the distinction between FWER and FDR concrete, we run a reproducible controlled experiment under stated assumptions. There is no market data and no Atamus data.
Each replication contains
hypotheses. Of these,
are true nulls with p-values distributed as
and
are alternatives with p-values distributed as
The alternatives are deliberately stylized. They create a population with real effects, but the p-values remain model-implied. The experiment uses 250,000 independent replications with seed 20260627.
At \(q=10\%\), the independent BH theoretical upper bound is
The Monte Carlo estimate is
with Monte Carlo standard error
The same simulation estimates expected discoveries as
with expected false discoveries
and expected genuine discoveries
Power, defined here as expected genuine discoveries divided by 50 alternatives, is
View data
| q | Monte Carlo FDR | Theoretical bound | Power | Expected discoveries | Expected false discoveries | Expected true discoveries |
|---|---|---|---|---|---|---|
| 1% | 0.8947% | 0.9000% | 10.4412% | 5.2796 | 0.0590 | 5.2206 |
| 2% | 1.7857% | 1.8000% | 13.0832% | 6.6851 | 0.1435 | 6.5416 |
| 3% | 2.6838% | 2.7000% | 14.9521% | 7.7194 | 0.2433 | 7.4760 |
| 4% | 3.5782% | 3.6000% | 16.4629% | 8.5873 | 0.3559 | 8.2315 |
| 5% | 4.4787% | 4.5000% | 17.7542% | 9.3578 | 0.4807 | 8.8771 |
| 6% | 5.3926% | 5.4000% | 18.9044% | 10.0696 | 0.6174 | 9.4522 |
| 7% | 6.2868% | 6.3000% | 19.9380% | 10.7310 | 0.7619 | 9.9690 |
| 8% | 7.1897% | 7.2000% | 20.9015% | 11.3690 | 0.9182 | 10.4507 |
| 9% | 8.0824% | 8.1000% | 21.7869% | 11.9756 | 1.0821 | 10.8935 |
| 10% | 8.9790% | 9.0000% | 22.6197% | 12.5659 | 1.2561 | 11.3099 |
| 11% | 9.8719% | 9.9000% | 23.4148% | 13.1476 | 1.4402 | 11.7074 |
| 12% | 10.7660% | 10.8000% | 24.1770% | 13.7227 | 1.6342 | 12.0885 |
| 13% | 11.6545% | 11.7000% | 24.9036% | 14.2885 | 1.8366 | 12.4518 |
| 14% | 12.5577% | 12.6000% | 25.6035% | 14.8539 | 2.0521 | 12.8017 |
| 15% | 13.4562% | 13.5000% | 26.2818% | 15.4182 | 2.2773 | 13.1409 |
| 16% | 14.3649% | 14.4000% | 26.9353% | 15.9815 | 2.5138 | 13.4677 |
| 17% | 15.2586% | 15.3000% | 27.5748% | 16.5468 | 2.7593 | 13.7874 |
| 18% | 16.1506% | 16.2000% | 28.1956% | 17.1116 | 3.0138 | 14.0978 |
| 19% | 17.0415% | 17.1000% | 28.8021% | 17.6806 | 3.2795 | 14.4011 |
| 20% | 17.9401% | 18.0000% | 29.3993% | 18.2579 | 3.5583 | 14.6996 |
This is not a claim about markets. It is a controlled experiment showing that FDR control is a statement about the expected composition of discoveries, not a guarantee about every realized discovery set.
For the same \(q=10\%\) experiment, the median realized FDP is
while the 95th percentile is
The probability that a realized batch has
is
This does not contradict FDR control. It explains it. FDR is an expectation. A realized discovery set can be worse than the target while the procedure remains correct in expectation.
View data
| Metric | Value |
|---|---|
| Mean FDP | 8.9790% |
| Median FDP | 8.3333% |
| 90th percentile FDP | 20.0000% |
| 95th percentile FDP | 25.0000% |
| 99th percentile FDP | 33.3333% |
| Probability FDP > q | 39.8344% |
| Probability FDP > 2q | 9.7080% |
9. What the arithmetic prevents
Multiple-discovery arithmetic prevents a common error: treating the final survivor of a search as if it had been specified in advance.
A selected result is conditioned on the family that produced it:
Even if
under a nondegenerate symmetric null family with more than one effectively independent candidate, we generally have
The result is not fraud. It is arithmetic. Searching changes the distribution of the selected statistic.
This is why Atamus treats research output as a conditional object. A result is conditioned on data, assumptions, model specification, dependence, search breadth, selection rule, implementation model, and risk constraints. Multiple discovery is the part of that problem concerned with the number and structure of claims examined before the surviving claim is named.
10. The institutional standard
The institutional standard is not to avoid research breadth. Serious quantitative research must examine alternatives, challenge assumptions, and discard weak claims. The standard is to account for that breadth.
A research process that tests one hypothesis at 5 percent is not equivalent to a process that tests 500 hypotheses at 5 percent and publishes the most attractive survivor. The latter must answer additional questions:
- What is the research family?
- Which hypotheses were tested?
- How dependent are they?
- What error criterion is being controlled?
- Is the target FWER, FDR, a deflated performance statistic, or another quantity?
- How much track record is required after accounting for selection?
- Which results remain credible under genuinely unseen data?
These questions do not reveal the source of an investment edge. They define the minimum public language of serious evidence.
11. Conclusion
A discovery is not just a small p-value. A selected Sharpe ratio is not just a Sharpe ratio. Both are outputs of a research family.
When the number of hypotheses grows, the probability of false discovery grows. Bonferroni and Sidak control the probability of at least one false discovery. Benjamini-Hochberg controls the expected false fraction among discoveries under its assumptions. Dependence changes the required correction. The expected maximum of a research family rises with search breadth. In finance, the same issue appears through selected Sharpe ratios, the deflated Sharpe ratio, and minimum track-record length.
The arithmetic is not optional. Without it, research can mistake selection for evidence. With it, a result must earn its status after the search that produced it has been brought back into the probability model.
Atamus does not publish this framework to disclose strategy mechanics. We publish it because any serious systematic research organization must understand the mathematics of discovery before it can speak responsibly about evidence.
References
[1] Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
[2] Benjamini, Y., and Yekutieli, D. (2001). The Control of the False Discovery Rate in Multiple Testing Under Dependency. Annals of Statistics, 29(4), 1165-1188.
[3] Dunn, O. J. (1961). Multiple Comparisons Among Means. Journal of the American Statistical Association, 56(293), 52-64.
[4] Sidak, Z. (1967). Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the American Statistical Association, 62(318), 626-633.
[5] White, H. (2000). A Reality Check for Data Snooping. Econometrica, 68(5), 1097-1126.
[6] Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business and Economic Statistics, 23(4), 365-380.
[7] Bailey, D. H., and Lopez de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality. Journal of Portfolio Management, 40(5), 94-107.
[8] Lo, A. W. (2002). The Statistics of Sharpe Ratios. Financial Analysts Journal, 58(4), 36-52.
[9] Harvey, C. R., Liu, Y., and Zhu, H. (2016). ... and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1), 5-68.
[10] Storey, J. D. (2002). A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society: Series B, 64(3), 479-498.
Disclaimer
Research notes published by Atamus Capital are provided for general informational and research purposes only. They do not constitute investment advice, trading advice, a recommendation, an offer to sell, or a solicitation to buy any security, fund interest, account, or investment product.
This note does not disclose Atamus Capital's proprietary strategies, signals, feature definitions, datasets, data transformations, model architectures, candidate-generation methods, training procedures, parameters, portfolio construction methods, execution processes, investment universe, research thresholds, model-development workflow, or investment decisions.