# DNA sequencing theory

**DNA sequencing theory** is the broad body of work that attempts to lay analytical foundations for DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects, predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses *physical processes* related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications^{[1]} sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues.

## Sequencing as a Covering Problem

All mainstream methods of DNA sequencing rely on reading small fragments of DNA and subsequently reconstructing these data to infer the original DNA target, either via assembly or alignment to a reference. The abstraction common to these methods is that of a mathematical covering problem^{[2]}. For example, one can imagine a line segment representing the target and a subsequent process where smaller segments are "dropped" onto random locations of the target. The target is considered "sequenced" when adequate coverage accumulates, for example when no gaps remain.

The abstract properties of covering have been studied by mathematicians for over a century^{[3]}. However, direct application of these results has not generally been possible. Closed-form mathematical solutions, especially for probability distributions, often cannot be readily evaluated. That is, they involve inordinately large amounts of computer time for parameters characteristic of DNA sequencing. Stevens' configuration is one such example^{[4]}. Results obtained from the perspective of pure mathematics also do not account for factors that are actually important in sequencing, for instance detectable overlap in sequencing fragments, double-stranding, edge-effects, and target multiplicity. Consequently, development of sequencing theory has proceeded more according to the philosophy of applied mathematics. In particular, it has been problem-focussed and makes expedient use of approximations, simulations, etc.

## Early Uses Derived From Elementary Probability Theory

The earliest result was actually borrowed directly from elementary probability theory. If we model the above process and take **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): L**
and **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): G**
as the fragment length and target length, respectively, then the probability of "covering" any given location on the target *with one particular fragment* is **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): L / G**
. Note that this presumes **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): L \ll G**
, which is valid for many, though not all sequencing scenarios. Utilizing concepts from the Binomial distribution^{[5]}, it can then be shown that the probability that the location is covered by at least one of **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): N**
fragments is

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): P = 1 - \left[1 - \frac{L}{G}\right]^N.**

This equation was first used to characterize plasmid libraries^{[6]}, but is often more useful in a modified form. For most projects **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): N \gg 1**
, so that, to a good degree of approximation

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): \left[1 - \frac{L}{G}\right]^N \sim \exp(-NL/G),**

where **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): R = NL/G**
is called the *redundancy*. Note the significance of redundancy as representing the average number of times a position is covered with fragments. Note also that in considering the covering process over all positions in the target, this probability is identical to the expected value of the random variable **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): C**
, which represents the fraction of the target coverage. The final result,

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): E\langle C \rangle = 1 - e^{-R},**

remains in widespread use as a "back of the envelope" estimator and predicts that coverage for all projects evolves along a universal curve that is a function only of the redundancy.

## Lander-Waterman Theory

In 1988, Eric Lander and Michael Waterman published an important paper^{[7]} examining the covering problem from the standpoint of gaps. Although they focused on the so-called mapping problem, the abstraction to sequencing is much the same. They furnished a number of useful results that were adopted as the standard theory from the earliest days of "large-scale" genome sequencing^{[8]}. Their model was also used in designing the Human Genome Project and continues to play an important role in DNA sequencing.

Ultimately, the main goal of a sequencing project is to close all gaps, so the "gap perspective" was a logical basis of developing a sequencing model. One of the more frequently used results from this model is the expected number of contigs, given the number of fragments sequenced. If one neglects the amount of sequence that is essentially "wasted" by having to detect overlaps, their theory yields

**Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): E\langle contigs \rangle = N e^{-R}.**

In 1995, Roach^{[9]} proposed a model that appeared to be essentially different and asserted that Lander-Waterman theory gave contradictory results for large values of **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): R**
. Wendl and Waterston^{[10]} later showed, based on Stevens' method^{[4]}, that subtle differences in interpretation explained the anomalies and that both models were indeed essentially identical and consistent.

The basic ideas of Lander-Waterman theory led to a number of additional results for particular variations in mapping techniques^{[11]}^{[12]}^{[13]}. However, technological advancements have rendered mapping and its more esoteric theories largely obsolete.

## Recent Advancements

The physical processes and protocols of DNA sequencing have continued to evolve, largely driven by advancements in bio-chemical methods, hardware, and automation techniques. There is now a wide range of problems that DNA sequencing has made in-roads into, including metagenomics and medical (cancer) sequencing. There are important factors in these scenarios that classical theory does not account for. Recent work has begun to focus on resolving the effects of some of these issues. The level of mathematics becomes commensurately more sophisticated.

### Multiplicity

Biologists have developed methods to filter highly-repetitive, essentially un-sequenceable regions of genomes. These procedures are important for organisms whose genomes consist mostly of such DNA, for example corn. They yield multitudes of small islands of sequenceable DNA products. Wendl and Barbazuk^{[14]} proposed an extension to Lander-Waterman Theory to account for "gaps" in the target due to filtering and the so-called "edge-effect". The latter is a position-specific sampling bias, for example the terminal base position has only a **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): 1 / G**
chance of being covered, as opposed to **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): L / G**
for interior positions. For **Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): R < 1**
, classical Lander-Waterman Theory still gives good predictions, but dynamics change for higher redundancies.

### Paired-End Sequencing

Modern sequencing methods usually sequence both ends of a larger fragment, which provides linking information for *de novo* assembly and improved probabilities for alignment to reference sequence. Researchers generally believe that longer lengths of data (read lengths) enhance performance for very large DNA targets, an idea consistent with predictions from distribution models^{[15]}. However, Wendl^{[16]} showed that smaller fragments provide better coverage on small, linear targets because they reduce the edge effect in linear molecules. These findings have implications for sequencing the products of DNA filtering procedures. Read-pairing and fragment size evidently have negligible influence for large, whole-genome class targets.

### Diploid Sequencing

Sequencing is emerging as an important tool in medicine, for example in cancer research. Here, the ability to detect heterozygous mutations is important and this can only be done if the sequence of the diploid genome is obtained. In the pioneering efforts to sequence individuals, Levy *et al.*^{[17]} and Wheeler *et al.*^{[18]}, who sequenced Craig Venter and Jim Watson, respectively, outlined models for covering both alleles in a genome. Wendl and Wilson^{[19]} followed with a more general theory that allowed for an arbitrary number of coverings of each allele and arbitrary ploidy. These results point to the general conclusion that the amount of data needed for such projects is significantly higher than for traditional haploid projects.

## Limitations

DNA sequencing theories often invoke the assumption that certain random variables in a model are independently and identically distributed. For example, in Lander-Waterman Theory, a sequenced fragment is presumed to have the same probability of covering each region of a genome and all fragments are assumed to be independent of one another. In actuality, sequencing projects are subject to various types of bias, including differences of how well regions can be cloned, sequencing anomalies, biases in the target sequence (which is *not* random), and software-dependent errors and biases. In general, theory will agree well with observation up to the point that enough data have been generated to expose latent biases^{[19]}. The kinds of biases related to the underlying target sequence are particularly difficult to model, since the sequence itself may not be known *a priori*. This presents a type of "chicken and egg" closure problem.

## Academic Status

Sequencing theory is based on elements of mathematics, biology, and[systems engineering, so it is highly interdisciplinary. Although many universities now have programs in computational biology, there does not yet seem to be a strong focus at the graduate level on this topic. Academic contributions have mainly been limited to a small number of PhD dissertations^{[20]}.

## See also

## References

- ↑ Waterman, M.S. (1995).
*Introduction to Computational Biology*. Chapman and Hall/CRC: Boca Raton. - ↑ Hall, P. (1988).
*Introduction to the Theory of Coverage Processes*. Wiley: New York. - ↑ Solomon, H. (1978).
*Geometric Probability*. Society for Industrial and Applied Mathematics: Philadelphia. - ↑
^{4.0}^{4.1}Stevens, W.L. (1939). "Solution to a Geometrical Problem in Probability".*Annals of Eugenics*.**9**: 315–320. - ↑ Feller, W. (1968).
*Introduction to Probability Theory and Its Applications (3rd Ed.)*. Wiley. - ↑ Clarke, L. and Carbon, J. (1976). "A Colony Bank Containing Synthetic Col-El Hybrid Plasmids Representative of the Entire E. coli Genome".
*Cell*.**9**: 91–99. doi:10.1016/0092-8674(76)90055-6. - ↑ Lander, E.S. and Waterman, M.S. (1988). "Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis".
*Genomics*.**2**: 231–239. - ↑ Fleischmann, R.D.; et al. (1995). "Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd".
*Science*.**269**: 496–512. doi:10.1126/science.7542800. - ↑ Roach, J.C. (1995). "Random Subcloning".
*Genome Research*.**5**: 464–473. doi:10.1101/gr.5.5.464. - ↑ Wendl, M.C. and Waterston, R.H. (2002). "Generalized Gap Model for Bacterial Artificial Chromosome Clone Fingerprint Mapping and Shotgun Sequencing".
*Genome Research*.**12**: 1943–1949. doi:10.1101/gr.655102. - ↑ Arratia, R.; et al. (1991). "Genomic Mapping by Anchoring Random Clones: A Mathematical Analysis".
*Genomics*.**11**: 806–827. doi:10.1016/0888-7543(91)90004-X. - ↑ Port, E.; et al. (1995). "Genomic Mapping by End-Characterized Random Clones: A Mathematical Analysis".
*Genomics*.**26**: 84–100. doi:10.1016/0888-7543(95)80086-2. - ↑ Zhang, M.Q. and Marr, T.G. (1993). "Genome Mapping by Nonrandom Anchoring: A Discrete Theoretical Analysis".
*Proceedings of the National Academy of Sciences*.**90**: 600–604. doi:10.1073/pnas.90.2.600. - ↑ Wendl, M.C. and Barbazuk, W.B. (2005). "Extension of Lander-Waterman Theory for Sequencing Filtered DNA Libraries".
*BMC Bioinformatics*.**6**: article 245. doi:10.1186/1471-2105-6-245. - ↑ Wendl, M.C. (2006). "Occupancy Modeling of Coverage Distribution for Whole Genome Shotgun DNA Sequencing".
*Bulletin of Mathematical Biology*.**68**: 179–196. doi:10.1007/s11538-005-9021-4. - ↑ Wendl, M.C. (2006). "A General Coverage Theory for Shotgun DNA Sequencing".
*Journal of Computational Biology*.**13**: 1177–1196. doi:10.1089/cmb.2006.13.1177. - ↑ Levy, S.; et al. (2007). "The Diploid Genome Sequence of an Individual Human".
*PLoS Biology*.**5**: article e254. doi:10.1371/journal.pbio.0050254. - ↑ Wheeler, D.A.; et al. (2008). "The Complete Genome of an Individual by Massively Parallel DNA Sequencing".
*Nature*.**452**: 872–876. doi:10.1038/nature06884. - ↑
^{19.0}^{19.1}Wendl, M.C. and Wilson, R.K. (2008). "Aspects of Coverage in Medical DNA Sequencing".*BMC Bioinformatics*.**9**: article 239. doi:10.1186/1471-2105-9-239. - ↑ Roach, J.C. (1998). "
*Random Subcloning, Pairwise End Sequencing, and the Molecular Evolution of the Vertebrate Trypsinogens*". PhD Dissertation, University of Washington.