BioBeans: Bioinformatics: Genome Assembly

Assembly-Solving Really Big Puzzles

One of the primary duties of a Bioinformacian is to combine little pieces of DNA into bigger pieces. When scientists sequence the genome of a species, it doesn't spit out of a machine in one magical lump. Sequencing machines (that read DNA sequences) produce lots of little sequences of DNA (strings of A's, T's, G's, or C's) 50-700 base pairs (bps) long. They spit out millions of them. The challenge of bioinformatics is to assemble those millions of short reads into the full sequence of the genome. Imagine shredding a textbook and putting the pieces back together. This process is called (no surprise here) Assembly, since we're assembling pieces of DNA into a larger sequence. This process really made a splash in 2003 when the human genome was sequenced ...

More than one way to Skin a Genome

The Human Genome Project was really a race between two projects, both attempting to assemble the entire human genetic sequence. The older, classical method used by the government's team worked like this:

Colony picker robot.

Get DNA from a person
Break it into pieces
Clone pieces into plasmids
Put plasmids in bacteria
Isolate individual bacterial colonies (each colony has one piece of human DNA on a plasmid)
Sequence each plasmid from each isolated colony
Put pieces together

This method was highly accurate, but very, very, slow. Each bacterial colony had to be picked by a robot, then stored in refrigerators and accessed later by another robot. Sequencing each individual colony took a long time, and a lot of money.

Shotgut Sequencing

"He puzzled and puzzed 'til his puzzler was sore ..." - Dr. Suess

A private company led by Craig Venter joined the race in the late 1990s. Their big idea, primarily devised by bioinformatician Eugene Myers, was to [1]:

Get DNA from a person
Break it into pieces
Sequence all the pieces at once
Put them back together using a complicated computer program (the first assembler)

Reproduced from reference 1.

At the time, assembling everything at once and putting it back together was thought to be impossible. It certainly couldn't be done by hand. It had to be done by a computer. Eugene's program worked by comparing every piece to every other piece. Each time the pieces were compared to each other, if there was enough overlapping sequence, they were combined.

Example Assembly

Check out this example:

Sequence 1: AATTCGTCGTCGCTCG

Sequence 2: CGAATCGTCGCAATTC

These sequences overlap, like so:

CGAATCGTCGCAATTC

AATTCGTCGTCGCTCG

and can be combined into a single sequence:

CGAATCGTCGCAATTCGTCGTCGCTCG

This was done over, and over and over until many of the small pieces were swallowed up into bigger pieces (bigger sequences are called "contigs" in bioinformatese). Since it is possible that two pieces could overlap by random chance, to diminish the possibility of overlapping two pieces by accident, overlaps had to be big (bigger than the 5 bps in our example). Eugene Myers' team nicknamed different sized contigs: small = rock, smaller = stone, smaller = pebble. By joining contigs one by one, the whole genome could be reconstructed.

The idea is simple enough, right? It's not that the concepts are too difficult that makes this hard in real life; It's the sheer, overwhelming amounts of data. The human genome is 3 BILLION bps long! Craig Venter's team needed a supercomputer to run the software to assemble the Human Genome. Assembly is now a commonplace part of biology, and there are many genomes much larger than 3 billion bps. No wonder CLC Bio has the saying: "Rocket Science is for kids, Bioinformatics is for scientists".

Rocket Science is for Kids - Try Bioinformatics

For more juicy details about the people and science involved in the Human Genome Project, check out this book:

The Genome War: How Craig Venter Tried to Capture the Code of Life and Save the World
by James Shreeve

1. The original paper where Eugene Myers describes his assembly algorithm is:

Myers EW, Sutton GG, Delcher AL, et al. (2000). A Whole-Genome Assembly of Drosophila. Science, 287:2196-2204.

Pages

Topics

Bioinformatics: Genome Assembly