Assembly-Solving Really Big Puzzles
One of the primary duties of a Bioinformacian is to combine little pieces of DNA into bigger pieces. When scientists sequence the genome of a species, it doesn't spit out of a machine in one magical lump. Sequencing machines (that read DNA sequences) produce lots of little sequences of DNA (strings of A's, T's, G's, or C's) 50-700 base pairs (bps) long. They spit out millions of them. The challenge of bioinformatics is to assemble those millions of short reads into the full sequence of the genome. Imagine shredding a textbook and putting the pieces back together. This process is called (no surprise here) Assembly, since we're assembling pieces of DNA into a larger sequence. This process really made a splash in 2003 when the human genome was sequenced ...More than one way to Skin a Genome
The Human Genome Project was really a race between two projects, both attempting to assemble the entire human genetic sequence. The older, classical method used by the government's team worked like this:Colony picker robot. |
- Get DNA from a person
- Break it into pieces
- Clone pieces into plasmids
- Put plasmids in bacteria
- Isolate individual bacterial colonies (each colony has one piece of human DNA on a plasmid)
- Sequence each plasmid from each isolated colony
- Put pieces together
Shotgut Sequencing
"He puzzled and puzzed 'til his puzzler was sore ..." - Dr. Suess
A private company led by Craig Venter joined the race in the late 1990s. Their big idea, primarily devised by bioinformatician Eugene Myers, was to [1]:
- Get DNA from a person
- Break it into pieces
- Sequence all the pieces at once
- Put them back together using a complicated computer program (the first assembler)
Reproduced from reference 1. |
Example Assembly
Check out this example:
Sequence 1: AATTCGTCGTCGCTCG
Sequence 2: CGAATCGTCGCAATTC
These sequences overlap, like so:
CGAATCGTCGCAATTC
AATTCGTCGTCGCTCG
and can be combined into a single sequence:
CGAATCGTCGCAATTCGTCGTCGCTCG
This was done over, and over and over until many of the small pieces were swallowed up into bigger pieces (bigger sequences are called "contigs" in bioinformatese). Since it is possible that two pieces could overlap by random chance, to diminish the possibility of overlapping two pieces by accident, overlaps had to be big (bigger than the 5 bps in our example). Eugene Myers' team nicknamed different sized contigs: small = rock, smaller = stone, smaller = pebble. By joining contigs one by one, the whole genome could be reconstructed.
The idea is simple enough, right? It's not that the concepts are too difficult that makes this hard in real life; It's the sheer, overwhelming amounts of data. The human genome is 3 BILLION bps long! Craig Venter's team needed a supercomputer to run the software to assemble the Human Genome. Assembly is now a commonplace part of biology, and there are many genomes much larger than 3 billion bps. No wonder CLC Bio has the saying: "Rocket Science is for kids, Bioinformatics is for scientists".
Rocket Science is for Kids - Try Bioinformatics
For more juicy details about the people and science involved in the Human Genome Project, check out this book:
The Genome War: How Craig Venter Tried to Capture the Code of Life and Save the Worldby James Shreeve
No comments:
Post a Comment
We are always glad when someone catches a mistake, has more to add, or just likes our work. Let us know about it!