1,081
edits
Changes
From Kogic.net
no edit summary
<p><span style="font-size: small"><span style="color: #000000">In bioinformatics, <b>sequence assembly</b> refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. </span></span></p><p><span style="font-size: small"><span style="color: #000000">This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases, depending on the technology used. </span></span></p><p><span style="font-size: small"><span style="color: #000000">Typically the short fragments, called <b>reads</b>, result from shotgun sequencing genomic DNA, or gene transcript (ESTs[[EST]]s).</span></span></p><p><b><span style="font-size: small"><span style="color: #000000">Sequence assembly as reconstructing a book</span></span></b></p><p> </p>
<p><span style="font-size: small"><span style="color: #000000">The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. The book may have many repeated paragraphs, and some shreds may be modified to have typos. Excerpts from another book may be added in, and some shreds may be completely unrecognizable.</span></span></p>
<p> </p>
<h2><span style="color: #000000"><span id="Genome_assemblers" class="mw-headline">Genome assemblers</span></span></h2>
<p><span style="font-size: small"><span style="color: #000000">The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. </span></span></p><p><span style="font-size: small"><span style="color: #000000">As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria and finally eukaryotes), the assembly programs needed to increasingly employ more and more sophisticated strategies to handle:</span></span></p>
<ul>
<li><span style="font-size: small"><span style="color: #000000">terabytes of sequencing data which need processing on computing clusters;</span> </span></li>
<li><span style="font-size: small"><span style="color: #000000">identical and nearly identical sequences (known as <i>repeats</i>) which can, in the worst case, increase the time and space complexity of algorithms exponentially;</span> </span></li>
<li><span style="font-size: small"><span style="color: #000000">and errors in the fragments from the sequencing instruments, which can confound assembly.</span> </span></li>
</ul>
<p><span style="font-size: small"><span style="color: #000000">Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit fly [[Drosophila melanogaster]], in <b>2000 </b>and the human genome just a year laterin <b>2001</b>, scientists developed assemblers like such as Celera Assembler<sup id="cite_ref-0" class="reference">[1]</sup> and [[Arachne]]<sup id="cite_ref-1" class="reference">[2]</sup> able to handle genomes of <b>100,000,000 -300 million ,000,000</b> base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as [[AMOS]]<sup id="cite_ref-2" class="reference">[3]</sup> was launched to bring together all the innovations in genome assembly technology under the open source framework.</span></span></p>
<h2><span style="font-size: medium"><span style="color: #000000"><span id="EST_assemblers" class="mw-headline">EST assemblers</span></span></span></h2>
<p><span style="font-size: small"><span style="color: #000000">EST assembly differs from genome assembly in several ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent only a subset of the whole genome. At a first glance, underlying algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, mainly in the inter-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, cells tend to have a certain number of genes that are constantly expressed in very high amounts (housekeeping genes), which again leads to the problem of similar sequences present in high amounts in the data set to be assembled.</span></span></p>