Hypergeometric distribution
Introduction
Probability mass function  
Cumulative distribution function  
Parameters  

Support  
Probability mass function (pmf)  
Cumulative distribution function (cdf)  
Mean  
Median  
Mode  
Variance  
Skewness  
Excess kurtosis 

Entropy  
Momentgenerating function (mgf)  
Characteristic function 
In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.
A typical example is illustrated by this contingency table:
drawn  not drawn  total  

defective  k  m − k  m 
nondefective  n − k  N + k − n − m  N − m 
total  n  N − n  N 
There is a shipment of N objects in which m are defective. The hypergeometric distribution describes the probability that in a sample of n distinctive objects drawn from the shipment exactly k objects are defective.
In general, if a random variable X follows the hypergeometric distribution with parameters N, m and n, then the probability of getting exactly k successes is given by
The probability is positive when k is between
and .
The formula can be understood as follows: There are possible samples (without replacement). There are ways to obtain k defective objects and there are ways to fill out the rest of the sample with nondefective objects.
The fact that the sum of the probabilities, as k runs through the range of possible values, is equal to 1, is essentially Vandermonde's identity from combinatorics.
Application and example
The classical application of the hypergeometric distribution is sampling without replacement. Think of an urn with two types of marbles, black ones and white ones. Define drawing a black marble as a success and drawing a white marble as a failure (analogous to the binomial distribution). If the variable N describes the number of all marbles in the urn (see contingency table above) and m describes the number of white marbles (called defective in the example above), then N − m corresponds to the number of black marbles.
Now, assume that there are 5 white and 45 black marbles in the urn. Standing next to the urn, you close your eyes and draw 10 marbles without replacement. What's the probability that you draw exactly 4 white marbles (and  of course  6 black marbles) ?
This problem is summarized by the following contingency table:
drawn  not drawn  total  

white marbles  4 (k)  1 = 5 − 4 (m − k)  5 (m) 
black marbles  6 = 10 − 4 (n − k)  39 = 50 + 4 − 10 − 5 (N + k − n − m)  45 (N − m) 
total  10 (n)  40 (N − n)  50 (N) 
The probability of drawing exactly x white marbles can be calculated by the formula
Hence, in this example x = 4, calculate
So, the probability of drawing exactly 4 white marbles is quite low (approximately 0.004) and the event is very unlikely. It means, if you repeated your random experiment (drawing 10 marbles from the urn of 50 marbles without replacement) 1000 times you just would expect to obtain such a result 4 times.
But what about the probability of drawing all 5 white marbles? You will intuitively agree upon that this is even more unlikely than drawing 4 white marbles. Let us calculate the probability for such an extreme event.
The contingency table is as follows:
drawn  not drawn  total  

white marbles  5 (k)  0 = 5 − 5 (m − k)  5 (m) 
black marbles  5 = 10 − 5 (n − k)  40 = 50 + 5 − 10 − 5 (N + k − n − D)  45 (N − m) 
total  10 (n)  40 (N − n)  50 (N) 
And we can calculate the probability as follows (notice that the denominator always stays the same):
As expected, the probability of drawing 5 white marbles is even much lower than drawing 4 white marbles.
Symmetries
This symmetry can be intuitively understood if you repaint all the black marbles to white and vice versa, thus the black and white marbles simply change roles.
This symmetry can be intuitively understood as swapping the roles of taken and not taken marbles.
This symmetry can be intuitively understood if instead of drawing marbles, you label the marbles that you would have drawn. Both expressions give the probability that exactly k marbles are "black" and labeled "drawn".
Symmetric application
The metaphor of defective and drawn objects depicts an application of the hypergeometric distribution in which the interchange symmetry between n and m is not of foremost concern. Here is a alternate metaphor which brings this symmetry into sharper focus, as there are also applications where it serves no purpose to distinguish n from m.
Suppose you have a set of N children who have been identified with an unusual bone marrow antigen. The doctor wishes to conduct a heredity study to determine the inheritance pattern of this antigen. For the purposes of this study, the doctor wishes to draw tissue from the bone marrow from the biological mother and biological father of each child. This is an uncomfortable procedure, and not all the mothers and fathers will agree to participate. Of the mothers, m participate and Nm decline. Of the fathers, n participate and Nn decline.
We assume here that the decisions made by the mothers is independent of the decisions made by the fathers. Under this assumption, the doctor, who is given n and m, wishes to estimate k, the number of children where both parents have agreed to participate. The hypergeometric distribution can be used to determine this distribution over k. It's not straightforward why the doctor would know n and m, but not k. Perhaps n and m are dictated by the experimental design, while the experimenter is left blind to the true value of k.
It's important to recognize given N, n and m that a single degree of freedom partitions N into four subpopulations: children where both parents participate, children where only the mother participates, children where only the father participates, and children where neither parent participates. Knowing any one of these values determines the other three by simple arithmetic relations. For this reason, each of these quadrants is governed by an equivalent hypergeometric distribution. The mean, mode, and values of k contained within the support differ from one quadrant to another, but the size of the support, the variance, and other high order statistics do not.
For the purpose of this study, it might make no difference to the doctor whether the mother participates or the father participates. If this happens to be true, the doctor will view the result as a threeway partition: children where both parents participate, children where one parent participates, children where neither parent participates. Under this view, the last remaining distinction between n and m has been eliminated. The distribution where one parent participates is the sum of the distributions where either parent alone participates.
Symmetry and sampling
To express how the symmetry of the clinical metaphor degenerates to the asymmetry of the sampling language used in the drawn/defective metaphor, we will restate the clinical metaphor in the abstract language of decks and cards. We begin with a dealer who holds two prepared decks of N cards. The decks are labelled left and right. The left deck was prepared to hold n red cards, and Nn black cards; the right deck was prepared to hold m red cards, and Nm black cards.
These two decks are dealt out face down to form N hands. Each hand contains one card from the left deck and one card from the right deck. If we determine the number of hands that contain two red cards, by symmetry relations we will necessarily also know the hypergeometric distributions governing the other three quadrants: hand counts for red/black, black/red, and black/black. How many cards must be turned over to learn the total number of red/red hands? Which cards do we need to turn over to accomplish this? These are questions about possible sampling methods.
One approach is to begin by turning over the left card of each hand. For each hand showing a red card on the left, we then also turn over the right card in that hand. For any hand showing a black card on the left, we do not need to reveal the right card, as we already know this hand does not count toward the total of red/red hands. Our treatment of the left and right decks no longer appears symmetric: one deck was fully revealed while the other deck was partially revealed. However, we could just as easily have begun by revealing all cards dealt from the right deck, and partially revealed cards from the left deck.
In fact, the sampling procedure need not prioritize one deck over the other in the first place. Instead, we could flip a coin for each hand, turning over the left card on heads, and the right card on tails, leaving each hand with one card exposed. For every hand with a red card exposed, we reveal the companion card. This will suffice to allow us to count the red/red hands, even though under this sampling procedure neither the left nor right deck is fully revealed.
By another symmetry, we could also have elected to determine the number of black/black hands rather than the number of red/red hands, and discovered the same distributions by that method.
The symmetries of the hypergeometric distribution provide many options in how to conduct the sampling procedure to isolate the degree of freedom governed by the hypergeometric distribution. Even if the sampling procedure appears to treat the left deck differently from the right deck, or governs choices by red cards rather than black cards, it is important to recognize that the end result is essentially the same.
Relationship to Fisher's exact test
The test (see above) based on the hypergeometric distribution (hypergeometric test) is identical to the corresponding onetailed version of Fisher's exact test. Reciprocally, the pvalue of a twosided Fisher's exact test can be calculated as the sum of two appropriate hypergeometric tests (for more information see the following web site).
Related distributions
Let X ~ Hypergeometric(, , ) and .
 If then has a Bernoulli distribution with parameter .
 If and are large compared to and is not close to 0 or 1, then where has a binomial distribution with parameters and .
 If is large, and are large compared to and is not close to 0 or 1, then
where is the standard normal distribution function
Multivariate hypergeometric distribution
Probability mass function  
Cumulative distribution function  
Parameters  

Support  
Probability mass function (pmf)  
Cumulative distribution function (cdf)  
Mean  
Median  
Mode  
Variance  
Skewness  
Excess kurtosis  
Entropy  
Momentgenerating function (mgf)  
Characteristic function 
The model of an urn with black and white marbles can be extended to the case where there are more than two colors of marbles. If there are m_{i} marbles of color i in the urn and you take n marbles at random without replacement, then the number of marbles of each color in the sample (k_{1},k_{2},...,k_{c}) has the multivariate hypergeometric distribution.
The properties of this distribution is given in the adjacent table, where c is the number of different colors and is the total number of marbles.
See also
 Binomial distribution
 Fisher's exact test
 Noncentral hypergeometric distributions
 Sampling (statistics)
 Urn problem
External links
 Hypergeometric Distribution Calculator
 Hypergeometric Distribution Calculator with source (Ruby, C++)
de:Hypergeometrische Verteilung it:Variabile casuale ipergeometrica he:התפלגות היפרגאומטרית hu:Hipergeometrikus eloszlás nl:Hypergeometrische verdeling fi:Hypergeometrinen jakauma sv:Hypergeometrisk fördelning