# Ewens's sampling formula

In population genetics, **Ewens's sampling formula**, introduced by Warren Ewens, states that under certain conditions (specified below), if a random sample of *n* gametes is taken from a population and classified according to the gene at a particular locus then the probability that there are *a*_{1} alleles represented once in the sample, and *a*_{2} alleles represented twice, and so on, is

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): \operatorname{Pr}(a_1,\dots,a_n)={n! \over \theta(\theta+1)\cdots(\theta+n-1)}\prod_{j=1}^n{\theta^{a_j} \over j^{a_j} a_j!},**

for some positive number **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): \theta**
, whenever *a*_{1}, ..., *a*_{n} is a sequence of nonnegative integers such that

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): a_1+2a_2+3a_3+\cdots+na_n=n.\,**

The phrase "under certain conditions", used above, must of course be made precise. The assumptions are (1) the sample size *n* is small by comparison to the size of the whole population, and (2) the population is in statistical equilibrium under mutation and genetic drift and the role of selection at the locus in question is negligible, and (3) every mutant allele is novel. (See also idealised population.)

This is a probability distribution on the set of all partitions of the integer *n*. Among probabilists and statisticians it is often called the **Ewens distribution**.

When **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): \theta=0,**
the probability is 1 that all *n* genes are the same. When **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): \theta=1**
, then the distribution is precisely that of the integer partition induced by a uniformly distributed random permutation. As **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): \theta\rightarrow\infty,**
the probability that no two of the *n* genes are the same approaches 1.

This family of probability distributions enjoys the property that if after the sample of *n* is taken, *m* of the *n* gametes are chosen without replacement, then the resulting probability distribution on the set of all partitions of the smaller integer *m* is just what the formula above would give if *m* were put in place of *n*.

The Ewens distribution arises naturally from the Chinese restaurant process.

## References

- Warren Ewens, "The sampling theory of selectively neutral alleles",
*Theoretical Population Biology*, volume 3, pages 87—112, 1972. - J.F.C. Kingman, "Random partitions in population genetics",
*Proceedings of the Royal Society of London, Series B, Mathematical and Physical Sciences*, volume 361, number 1704, 1978. - S. Tavare and W. J. Ewens, "The Ewens sampling formula". In
*Multivariate discrete distributions*by N.L. Johnson, S. Kotz, and N. Balakrishnan (eds), 1997, Wiley.