Information

Is there a way to measure the amount of bytes that are possible to encode in a DNA molecule?

Is there a way to measure the amount of bytes that are possible to encode in a DNA molecule?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

When I saw a DNA molecule for the first time, it kinda reminded me of a hard drive. It consists of slots and there are some possible combinations for each slot; in the hard drive these possible combinations would be 0's and 1's. In DNA, these slots would be G's, A's, T's, C's.

So, is there a way to measure the amount of bytes that are encoded in a DNA molecule?

I've made this question before in another forum, but the answerer provided me only with Shannon's theorem, which is $K=L-frac{(1-q^L)^n}{q^L}$ and told me a little about genetic redundancy. I could only search for the ammount of slots which are present in the DNA, but this genetic redundancy thing got me stuck.


Unfortunately the answer is highly dependent on what you mean. In the simplest terms, comparing it directly to how we measure data storage in digital media, the number of different states of a DNA string of length $n$ can have is simply $4^n$. A byte holds $2^8$ different states so the number of bytes in a DNA string of length $n$ is $frac{n}{4}$. Of course, actually accessing this information would be more difficult than simply having it in a strand of DNA.

DNA in real organisms is not random, however, so sequences are not randomly distributed meaning that you could compress that information down into less bits than that, Shannon information style.

However, you can decently argue that this doesn't really tell us how much information is in real DNA because real DNA has structure that matters. It has exons and introns and promoter regions and so on. Meanwhile, the sequences in protein coding regions are a lot more important than sequences in non-coding regions for the most part. Large parts of the genome are functionally irrelevant but because of this they tend to be more random and thus have higher Shannon information.


You might also be interested in this paper from EMBL-EBI about storing data on DNA.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

They show they can get 757,051 bytes or a Shannon information 10 of 5.2 × 106 bits onto 153,335 strings of DNA, each comprising 117 nucleotides (nt).

George Church had a similar paper recently as well - Science


DNA Data Storage Is About To Go Viral

In 1862, Gregor Mendel bred pea plants to study inheritance. Fast forward 100 years to 1962, James Watson, Frances Crick, and Maurice Wilkins were awarded a Nobel Prize for discovering the structure of DNA. Today, advances in this field are spilling over into the most unlikely places.

As we enter the century of biotechnology, our ability to read, write, and edit DNA is disrupting everything from human health to manufacturing. The next disruption to take place could be in the world of data storage.

Tech giants including Facebook and Amazon and their millions of users generate petabytes of data on the Internet every second. Microsoft has been quietly working in the background to store this information in As, Ts, Cs, and Gs, instead of 0s and 1s.

“Think of compressing all the information on the accessible Internet into a shoebox,” says Karin Strauss, a principal researcher at Microsoft. “With DNA data storage, that’s possible.”

Strauss is working with Luis Ceze , a professor of computer science and engineering at the University of Washington, to wield DNA for data storage and computing . Using synthetic DNA molecules, the team has successfully stored over one gigabyte of readable information, including various forms of media such as the top 100 books from Project Gutenberg, a high-definition OK Go music video , and the #MemoriesInDNA project .

The information density of DNA is remarkable — just one gram can store 215 petabytes , or 215 million gigabytes, of data. For context, the average hard drive in a laptop can house just one millionth of that amount.

“We encode all data at a molecular level, making it as small as possible, and store it in a medium that will last for quite a while and not become obsolete, like floppy disks, because of its eternal relevance for life,” says Strauss.

Researchers Luis Ceze and Karin Strauss. Photo: Tara Brown Photography/UW

The Rise of DNA Data Storage

Year Project Team Size
1988 “Microvenus” Joe Davis with Harvard and UC Berkeley 28 base pairs
2011 Encoding 70 billion copies of Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves in DNA Church Lab @ the Wyss Institute 5.27MB
2016 OK Go Music Video, top 100 books in Project Gutenberg, and more Microsoft Research/UW Over 1GB
2018 Storing and retrieving information with template-free polymerase Molecular Assemblies 150 base pairs
2019 Writing “hello” with fully automated end-to-end DNA data storage and computing Microsoft Research/UW 5 bytes
2019 Encoding all of Wikipedia (in English) Catalog 16GB
2019 Long oligonucleotides Twist Bioscience 300nt

Improved techniques for reading and writing DNA, including an increase in the length of strands of DNA usable for these purposes, have facilitated the rapid increase in the amount of possible data storage in DNA.

In addition to pioneering high-density data storage, Ceze and Strauss also conducted a similarity search between images using DNA, and recently created the first fully-automated, writing-to-reading DNA storage system.

“We’re trying to make computers better with a systematic approach that finds great alternatives and solutions in nature,” says Strauss. “The computational approaches facilitated by working with DNA make it an even more attractive option for data storage,” adds Ceze. “We have the freedom to choose how to map bits to DNA sequences, creating redundancy and high tolerance to error when reading and writing DNA.”

How does this technology work? It’s surprisingly simple. Data is first translated from a code of 0s and 1s to As, Ts, Cs, and Gs. This genetic code is then synthesized into an actual molecule (with the help of Twist Bioscience for the Microsoft Research-UW team), and the “encoding” process is complete.

Retrieving data is a bit more complex. Two steps — “processing” and “decoding” — must occur. Simulating random-access memory (RAM), a polymerase chain reaction (PCR, a common laboratory protocol for copying DNA) hones in on a targeted section of the sequence, which is then replicated, sequenced, decoded, and adjusted for errors to retrieve the original data. This targeted approach is efficient because it involves only the desired sequence, rather than the entire dataset.

The rise of DNA data storage, previously the stuff of science fiction, is being made possible by advances in biotechnology, particularly improvements in high-throughput DNA sequencing and synthesis . Also, because these bio-programmers control what materials enter their experiments, and their sequences do not need to be meticulously engineered to function within a living organism, there are fewer overhead costs compared to typical life science experiments. The journey has not been without roadblocks, however. Despite dramatic improvement, working with DNA can be slow and expensive. Further streamlining is still needed.

“Automation was, and is, one of our biggest challenges,” says Strauss. “It was great to have our first proof of concept converting information from bits, to DNA, and back to bits to prove that it was possible and also show what are our other challenges in automation, but some of the biotechnology aspects are quite new to some of us, so we’ve also been learning a lot there. The other significant challenges are continuing to increase throughput and decrease the cost for DNA sequencing and synthesis. There’s quite a bit of engineering left to get [us] to where we need to be.”

The interdisciplinary Microsoft and UW team sees value in its diverse background. “It’s extremely exciting that this is at the intersection of biotech and computing,” says Ceze. “These areas have been feeding off each other.”

“If the technology continues to advance the way we see it right now,” he says, “I think it’s conceivable that we will see DNA storage as a form of archival for the general public within the decade.”

Thank you to Aishani Aatresh for additional research and reporting in this post. Aishani is a senior at Saint Francis High School in Mountain View, CA, a TEDx speaker and event director, co-founder of LancerHacks , and researcher at Distributed Bio developing computational immunoengineering methods to generate superior antibodies.

Disclaimer: I am the founder of SynBioBeta, the innovation network for the synthetic biology industry. Some of the companies that I write about are sponsors of the SynBioBeta conference (click here for a full list of sponsors).


How do you store data on DNA?

How do scientists store data on DNA and why are they doing it?

Data could be safely stored for Millions of years in DNA in this global seed fault at Svalbard, Norway (Source: Dag Terje Filip Endresen/Wikimedia Commons)

Earlier this year, Swiss researchers reported they developed a technique for storing text, audio, images and video for millions of years, coded into DNA and embedded in glass spheres.

Bioinformatics expert Associate Professor Jonathan Keith of Monash University says such efforts are being driven by the fact that current methods of storing data have a finite life.

Paper and microfilm might survive for over 500 years, but information on CDs and computer disks can often be corrupted, especially as they need to be updated into different formats as technology changes.

"Electronic media is not necessarily safer than paper media," says Keith.

So as human civilisation generates ever more quantities of data, scientists are working on ever more reliable means of storing it long term.

Since the 1990s, a handful of papers have reported the effort to preserve our archives by coding it into DNA molecules.

"All digital images, videos, audio files and text are reduced to strings of zeros and ones," says Keith. "With DNA you have four different bases that make up the molecule so you have a four-character code instead of a two-character code. But the principle is the same."

A computer is used to translate what is required — whether it be a colour, a position, or a letter in the alphabet -- into a particular sequence, using the four-character DNA code.

No living organisms are involved in creating the DNA code. Rather, synthetic DNA molecules with the sequence of required bases are created from scratch.

After a period of storage, a computer is then used to decode the data.

Why store information on DNA?

Storing information on DNA might seem a bit leftfield but, as Keith says, DNA has been used as an "information storage device" in living organisms for millions of years.

And ancient DNA of woolly mammoths, bears and humans dug out of the permafrost suggests DNA has the ability to last a very long time in cold storage — tens to hundreds of thousands of years.

"If you think of it in that way it then it becomes natural to try and use that molecule for our purposes as well," says Keith.

Also, because DNA is a molecule, it takes up far less space than other storage formats.

"The human genome is three billion bases in length and that is stored as a small number of molecules in every cell nucleus," says Keith.

"We're talking about vast amounts of data crammed into minute volumes."

One cup of DNA could store 100 million hours of HD video, say Nick Goldman and Ewan Birney of the European Bioinformatics Institute of the European Molecular Biology Laboratory.

In 2013, their team reported the successful storage of 739 kilobytes of data in DNA — including a colour image, Shakespeare's 154 sonnets, an excerpt from Martin Luther King's "I have a dream" speech and the classic 1953 paper on DNA structure by Watson and Crick.

How long will information on DNA last?

Goldman and Birney suggest the technology they developed could be eventually be used to store data for up to 50 years.

In February 2015, another team led by Dr Robert Grass from the Institute for Chemical and Bioengineering at ETH Zurich encoded the Swiss Federal Charter from 1291 and the English translation of the ancient Archimedes Palimpsest on "The Methods of Mechanical Theorems" into DNA.

While this is only 83 kilobytes of data, Grass and colleagues say they found a way to store the DNA for millions of years.

Keith says the Swiss team's paper provides two advances. First, they have stored the information with 'error correcting codes'.

"It includes some redundancy in the coding so information can be recovered even if some corruption of the stored information occurs," he says.

"This extends the lifetime of data storage because it means we can keep it once the DNA starts to degrade."

Second, Grass and colleagues have stored the DNA-encoded information in 'synthetic silica fossilisation technology' - in other words glass spheres.

"The data is very stable when it is stored in that form," says Keith.

To test the reliability of this encoding and storage method Grass and colleagues heated the DNA-encoded data encased in glass spheres to 70°C for one week and found they could still recover the original data error free.

"The rate of which the data degrades depends on the temperature," says Keith.

Their study suggests data could be stored for 2000 years at 9.4°C or for two million years at - 18°C in the Global Seed Bank in Svalbard, Norway.

"Both papers are significant advances," says Keith. "Neither of them gives us technology we're going to go out and buy at the supermarket tomorrow but both of them are big steps towards a functional very long term storage device."

When will DNA storage be widely used?

Before DNA data storage can be more widely used researchers need to work out how store multiple megabytes of data — the aim is to store zettabytes (sextillion or 10 to the power 21 bytes), says Keith.

And the cost of encoding and synthesising DNA needs to come down — in 2013 it cost US$12,400 for each megabyte of data.

But, Keith expects in a decade costs significant advances could be made on both fronts.

"These things move exponentially rapidly."

For now, storing data in the form of DNA will not be for the everyday person.

"At the moment it looks like the technology is going to mainly be useful for important data that needs to be stored for a long time but don't have to be accessed frequently."

This may include important government and cultural information, but Keith believes a major application will be storage of data generated by scientific projects.

He cites the Large Hadron Collider, which generates a staggering 15 petabytes (1000 terrabytes or 10 to the power 15 bytes) of data per year on its own.

"It generates huge amounts of data but a very limited number of researchers actually work with that, and it's the kind of thing where a decade from now they might come back to data generated recently looking for something in particular."

Associate Professor Jonathan Keith spoke with Anna Salleh


Data availability

Information on data availability for all samples is available in Supplementary Table 1. NanoSeq sequencing data have been deposited in the European Genome-phenome Archive (EGA https://www.ebi.ac.uk/ega/) under accession number EGAD00001006459. Sperm samples are available from the EGA under accession number EGAD00001007028. Standard sequencing data have been deposited in the EGA under accession number EGAD00001006595. For publicly available samples, references to the original sources are provided in Supplementary Table 1. Substitution and indel rates are available in Supplementary Table 4. Substitution and indel calls for samples sequenced with NanoSeq are available in Supplementary Tables 5, 6. Trinucleotide substitution profiles are available in Supplementary Table 7. A detailed NanoSeq protocol is available in Protocol Exchange 53 .


Acknowledgments

We wish to acknowledge Fridolin Groß, who was part of the many discussions at the origin of this paper and carefully commented several versions of the paper. In addition, we wish to thank all those who have read drafts of this paper: Michel Morange, Michael Weisberg, Iros Barozzi, Lorenzo Del Savio, Marcel Weber and the lgBIG group in Geneva (in which the paper was discussed), Alkistis Elliot-Graves and Vera Pendino. We are also thankful to our colleagues of the FOLSATEC programme. Finally, we wish to acknowledge the two anonymous reviewers for their help in improving the text.


11 Answers 11

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.

You do not store all the DNA in one stream, rather most the time it is store by chromosomes.

A large chromosome take about 300 MB and a small one about 50 MB.

I think the first reason why it is not saved in 2 bits per base pair is that it would cause an hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs.

1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.

Another point is that the data isn't as simple as you get told.

e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on.

Another example is the DNA methylation because you can't store this Information in a 2-bit representation.

Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits

I'm no expert, however, the Human Genome page on Wikipedia states the following:

I'm not sure where their variance comes from, but I'm sure you can figure it out.

Yes, the minimum RAM needed for whole human DNA is about 770 MB. However, the 2-bit representation is impractical. It is hard to search through or do some computations on it. Therefore some mathematicians designed more effective way to store those sequencies of bases . and use them in searching and comparation algorithms such as for example GARLI (www.bio.utexas.edu/faculty/antisense/garli/garli.html ). This application runs on my PC right now, so I can say to You. that it practically has the DNA stored in about: 1 563 MB.

The human genome contains 2.9 billion base pairs. So if you represented each base pair as a byte then it would take 2.9 billion bytes or 2.9 GB. You could probably come up with a more creative way of storing base pairs as each base pair only requires 2 bits. So you could probably store 4 base pairs per byte bringing down the total of less than a GB.

= bytes. 2.9 billion bits is around 350 MB &ndash SDGuero Apr 22 '14 at 23:01

just did it too. the raw sequence is

700 MB. if one uses a fixed storage sequence or a fixed sequence storage algoritm - and the fact that the changes are 1% i calcuated

120 MB with a perchromosome-sequenceoffset-statedelta storage. that's it for the storage.

There are 4 nucleotide bases that make up our DNA these are A,C,G,T therefore for each base in the DNA takes up 2bits. There are around 2.9billion bases so thats around 700 megabytes. The weird thing is that would fill a normal data cd! coincidence.

Most answers except users slayton, rauchen, Paul Amstrong are dead wrong if its about pure storage one-on-one without compression techniques.

The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not

750MB. The constructed "haploid" genome according to NCBI is currently 3436687kb or 3.436687 Gb in size. Check here for yourself.

Haploid = single copy of a chromosome. Diploid = two versions of haploid. Humans have 22 unique chromosomes x 2 = 44. Male 23rd chromosome is X, Y and makes 46 in total. Females 23rd chrom. is X, X and thus makes 46 in total.

For males it would be 23 + 1 chromosome in data storage on a HDD and for females 23 chromosomes, explaining the little differences mentioned now and then in answers. The X chrom. from males is equal to X chrom. from the females.

Thus loading the genome (23 + 1) into memory is done in parts via BLAST using constructed databases from fasta-files. Regardless of zipped versions or not nucleotides are hardly to be compressed. Back in the early days one of the tricks used was to replace tandem repeats (GACGACGAC with shorter coding e.g. "3GAC" 9byte to 4byte). The reason was to save harddrive space (area of the 500bm-2GB HDDD platters with 7.200 rpm and SCSI connectors). For sequence searching this was also done with the query.

If "coded nucleotide" storage would be 2-bit per letter then you get for a byte:

Only this way you fully profit from positions 1,2,3,4,5,6,7 and 8 for 1 byte of coding. For example the combination 00.01.10.11 (as byte 00011011 ) would then correspond for "ACTG" (and show in a textfile as an unrecognizable character). This alone is responsible for a four times reduction in file-size as we see in other answers. Thus 3.4Gb will be downsized to 0.85917175 Gb.

860MB including a then required conversion program (23kb-4mb).

But. in biology you want to be able to read something thus compression gzipped is more than enough. Unzipped you can still read it. If this byte filling was used it becomes harder to read the data. That's why fasta-files are plain-text files in reality.


Is there a way to measure the amount of bytes that are possible to encode in a DNA molecule? - Biology

Organelle DNA

Not all genetic information is found in nuclear DNA.

A dominant allele is an allele that is almost always expressed, even if only one copy is present.

The Globin Genes: An Example of Transcriptional Regulation

An example of transcriptional control occurs in the family of genes responsible for the production of globin.

Transcription, the synthesis of an RNA copy from a sequence of DNA, is carried out by an enzyme called RNA polymerase.

Transcription, the synthesis of an RNA copy from a sequence of DNA, is carried out by an enzyme called RNA polymerase.

The beginning of translation, the process in which the genetic code carried by mRNA directs the synthesis of proteins from amino acids, differs slightly for prokaryotes and eukaryotes, although both processes always initiate at a codon for methionine.

So, the possible allele combinations result in a particular blood type in this way:
OO = blood type O
AO = blood type A
BO = blood type B
AB = blood type AB
AA = blood type A
BB = blood type B

You can see that a person with blood type B may have a B and an O allele, or they may have two B alleles.

Gene Switching: Turning Genes On and Off

The estimated number of genes for humans, less than 30,000, is not so different from the 25,300 known genes of Arabidopsis thaliana, commonly called mustard grass.

So, the possible allele combinations result in a particular blood type in this way:
OO = blood type O
AO = blood type A
BO = blood type B
AB = blood type AB
AA = blood type A
BB = blood type B

You can see that a person with blood type B may have a B and an O allele, or they may have two B alleles.

Gene Switching: Turning Genes On and Off

The estimated number of genes for humans, less than 30,000, is not so different from the 25,300 known genes of Arabidopsis thaliana, commonly called mustard grass.

Mendel's Principles of Genetic Inheritance

Law of Segregation: Each of the two inherited factors (alleles) possessed by the parent will segregate and pass into separate gametes (eggs or sperm) during meiosis, which will each carry only one of the factors.

The Core Gene Sequence: Introns and Exons

Genes make up about 1 percent of the total DNA in our genome.

Molecular Genetics: The Study of Heredity, Genes, and DNA

As we have just learned, DNA provides a blueprint that directs all cellular activities and specifies the developmental plan of multicellular organisms.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Francis Crick

Although DNA is the carrier of genetic information in a cell, proteins do the bulk of the work.

Gene Prediction Using Computers

When the complete mRNA sequence for a gene is known, computer programs are used to align the mRNA sequence with the appropriate region of the genomic DNA sequence.

So, the possible allele combinations result in a particular blood type in this way:
OO = blood type O
AO = blood type A
BO = blood type B
AB = blood type AB
AA = blood type A
BB = blood type B

You can see that a person with blood type B may have a B and an O allele, or they may have two B alleles.

Mendel's Principles of Genetic Inheritance

Law of Segregation: Each of the two inherited factors (alleles) possessed by the parent will segregate and pass into separate gametes (eggs or sperm) during meiosis, which will each carry only one of the factors.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Francis Crick

Although DNA is the carrier of genetic information in a cell, proteins do the bulk of the work.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Mendel's Laws-How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of seven different pea traits by carefully test-crossing many distinct varieties of peas.

Gene Prediction Using Computers

When the complete mRNA sequence for a gene is known, computer programs are used to align the mRNA sequence with the appropriate region of the genomic DNA sequence.

Gene Prediction Using Computers

When the complete mRNA sequence for a gene is known, computer programs are used to align the mRNA sequence with the appropriate region of the genomic DNA sequence.

From One Gene-One Protein to a More Global Perspective

Only a small percentage of the 3 billion bases in the human genome becomes an expressed gene product.

Gene Switching: Turning Genes On and Off

The estimated number of genes for humans, less than 30,000, is not so different from the 25,300 known genes of Arabidopsis thaliana, commonly called mustard grass.

A class of sequences called regulatory sequences makes up a numerically insignificant fraction of the genome but provides critical functions.

Mendel's Laws-How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of seven different pea traits by carefully test-crossing many distinct varieties of peas.

From One Gene-One Protein to a More Global Perspective

Only a small percentage of the 3 billion bases in the human genome becomes an expressed gene product.

Molecular Genetics: The Study of Heredity, Genes, and DNA

As we have just learned, DNA provides a blueprint that directs all cellular activities and specifies the developmental plan of multicellular organisms.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Molecular Genetics: The Study of Heredity, Genes, and DNA

As we have just learned, DNA provides a blueprint that directs all cellular activities and specifies the developmental plan of multicellular organisms.

Mendel's Laws-How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of seven different pea traits by carefully test-crossing many distinct varieties of peas.

Mechanisms of Genetic Variation and Heredity

Does Everyone Have the Same Genes?

If the child is BB or BO, they have blood type B. If the child is OO, he or she will have blood type O.

Pleiotropism, or pleotrophy, refers to the phenomenon in which a single gene is responsible for producing multiple, distinct, and apparently unrelated phenotypic traits, that is, an individual can exhibit many different phenotypic outcomes.

Although DNA is the carrier of genetic information in a cell, proteins do the bulk of the work.

The Physical Structure of the Human Genome

Nuclear DNA

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

The Influence of DNA Structure and Binding Domains

Sequences that are important in regulating transcription do not necessarily code for transcription factors or other proteins.

Mechanisms of Genetic Variation and Heredity

Does Everyone Have the Same Genes?

Controlling Transcription

Promoters and Regulatory Sequences

Transcription is the process whereby RNA is made from DNA.

The Core Gene Sequence: Introns and Exons

Genes make up about 1 percent of the total DNA in our genome.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

The Influence of DNA Structure and Binding Domains

Sequences that are important in regulating transcription do not necessarily code for transcription factors or other proteins.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Gene Prediction Using Computers

When the complete mRNA sequence for a gene is known, computer programs are used to align the mRNA sequence with the appropriate region of the genomic DNA sequence.

Gene Switching: Turning Genes On and Off

The estimated number of genes for humans, less than 30,000, is not so different from the 25,300 known genes of Arabidopsis thaliana, commonly called mustard grass.

Transcription, the synthesis of an RNA copy from a sequence of DNA, is carried out by an enzyme called RNA polymerase.

From Genes to Proteins: Start to Finish

We just discussed that the journey from DNA to mRNA to protein requires that a cell identify where a gene begins and ends.

Forty to forty-five percent of our genome is made up of short sequences that are repeated, sometimes hundreds of times.

A class of sequences called regulatory sequences makes up a numerically insignificant fraction of the genome but provides critical functions.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

From One Gene-One Protein to a More Global Perspective

Only a small percentage of the 3 billion bases in the human genome becomes an expressed gene product.

A class of sequences called regulatory sequences makes up a numerically insignificant fraction of the genome but provides critical functions.

The Core Gene Sequence: Introns and Exons

Genes make up about 1 percent of the total DNA in our genome.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane- bounded region that provides a sanctuary for genetic information.

Forty to forty-five percent of our genome is made up of short sequences that are repeated, sometimes hundreds of times.

Expression of Inherited Genes

Gene expression, as reflected in an organism's phenotype, is based on conditions specific for each copy of a gene.

If the child is BB or BO, they have blood type B. If the child is OO, he or she will have blood type O.

Pleiotropism, or pleotrophy, refers to the phenomenon in which a single gene is responsible for producing multiple, distinct, and apparently unrelated phenotypic traits, that is, an individual can exhibit many different phenotypic outcomes.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Mendel's Laws-How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of seven different pea traits by carefully test-crossing many distinct varieties of peas.

If the child is BB or BO, they have blood type B. If the child is OO, he or she will have blood type O.

Pleiotropism, or pleotrophy, refers to the phenomenon in which a single gene is responsible for producing multiple, distinct, and apparently unrelated phenotypic traits, that is, an individual can exhibit many different phenotypic outcomes.

The Influence of DNA Structure and Binding Domains

Sequences that are important in regulating transcription do not necessarily code for transcription factors or other proteins.

From One Gene-One Protein to a More Global Perspective

Only a small percentage of the 3 billion bases in the human genome becomes an expressed gene product.

Although DNA is the carrier of genetic information in a cell, proteins do the bulk of the work.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

Expression of Inherited Genes

Gene expression, as reflected in an organism's phenotype, is based on conditions specific for each copy of a gene.

Forty to forty-five percent of our genome is made up of short sequences that are repeated, sometimes hundreds of times.

Mendel's Laws-How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of seven different pea traits by carefully test-crossing many distinct varieties of peas.

Gene Prediction Using Computers

When the complete mRNA sequence for a gene is known, computer programs are used to align the mRNA sequence with the appropriate region of the genomic DNA sequence.

The Physical Structure of the Human Genome

Inside each of our cells lies a nucleus, a membrane-bounded region that provides a sanctuary for genetic information.

The Influence of DNA Structure and Binding Domains

Sequences that are important in regulating transcription do not necessarily code for transcription factors or other proteins.

Expression of Inherited Genes

Gene expression, as reflected in an organism's phenotype, is based on conditions specific for each copy of a gene.

Forty to forty-five percent of our genome is made up of short sequences that are repeated, sometimes hundreds of times.

Not all genetic information is found in nuclear DNA.

Mendel's Laws-How We Inherit Our Genes

In 1866, Gregor Mendel studied the transmission of seven different pea traits by carefully test- crossing many distinct varieties of peas.


Conclusions

So will DNA ever be used to solve a traveling salesman problem with a higher number of cities than can be done with traditional computers? Well, considering that the record is a whopping 13,509 cities, it certainly will not be done with the procedure described above. It took this group only three months, using three Digital AlphaServer 4100s (a total of 12 processors) and a cluster of 32 Pentium-II PCs. The solution was possible not because of brute force computing power, but because they used some very efficient branching rules. This first demonstration of DNA computing used a rather unsophisticated algorithm, but as the formalism of DNA computing becomes refined, new algorithms perhaps will one day allow DNA to overtake conventional computation and set a new record.

On the side of the "hardware" (or should I say "wetware"), improvements in biotechnology are happening at a rate similar to the advances made in the semiconductor industry. For instance, look at sequencing what once took a graduate student 5 years to do for a Ph.D thesis takes Celera just one day. With the amount of government funded research dollars flowing into genetic-related R&D and with the large potential payoffs from the lucrative pharmaceutical and medical-related markets, this isn't surprising. Just look at the number of advances in DNA-related technology that happened in the last five years. Today we have not one but several companies making "DNA chips," where DNA strands are attached to a silicon substrate in large arrays (for example Affymetrix's genechip). Production technology of MEMS is advancing rapidly, allowing for novel integrated small scale DNA processing devices. The Human Genome Project is producing rapid innovations in sequencing technology. The future of DNA manipulation is speed, automation, and miniaturization.

And of course we are talking about DNA here, the genetic code of life itself. It certainly has been the molecule of this century and most likely the next one. Considering all the attention that DNA has garnered, it isn’t too hard to imagine that one day we might have the tools and talent to produce a small integrated desktop machine that uses DNA, or a DNA-like biopolymer, as a computing substrate along with set of designer enzymes. Perhaps it won’t be used to play Quake IV or surf the web -- things that traditional computers are good at -- but it certainly might be used in the study of logic, encryption, genetic programming and algorithms, automata, language systems, and lots of other interesting things that haven't even been invented yet.


Mastering Biology: Chapter 1 .

All living things share a common genetic language of DNA because they share a common ancestry.

*The genetic code is arbitrary, at least to some extent. The fact that all organisms share a single genetic code is due to their common ancestry. Read about the core theme of biology: Evolution accounts for the unity and diversity of life.

Metabolic cooperation between prokaryotic cells forms a biofilm that allows bacterial colonies to transport nutrients and wastes. Biofilms may damage industrial equipment or cause tooth decay.

*This emergent property emerges at the community level, due to the interactions among prokaryotic species forming the biofilm. Read about levels of biological organization in and emergent properties.

A tree and its physical environment alter each other.

*This answer is the most accurate statement. Read the themes about the interactions among organisms and their physical environment.

It is possible to test hypotheses, such as those involving historical events, without conducting experiments.

*Although it is not possible to carry out experiments to test hypotheses about evolutionary relationships between living groups or about the timing of the origin of major evolutionary innovations, such hypotheses can be evaluated by making predictions about the expected findings that would result from these hypotheses. Data can then be collected to test whether these predictions are correct. Read about the forms of scientific inquiry and the scientific method.

You learned in elementary school that as temperature drops, liquids change into solid form. You are given an unfamiliar liquid and hypothesize that it will become solid if you put it in the freezer.

*This is deductive reasoning. You are predicting specific results based on general principles.

Current Events: Genome Detectives Solve a Hospital's Deadly Outbreak (New York Times, 8/22/2012)

About how many people die in the U.S. each year from hospital-acquired infections?

Current Events: Genome Detectives Solve a Hospital's Deadly Outbreak (New York Times, 8/22/2012)

You are an epidemiologist specializing in antibiotic-resistant strains of bacteria. For the best hope in saving a patient's life, you want to stop the infection before it gets where?

Current Events: Genome Detectives Solve a Hospital's Deadly Outbreak (New York Times, 8/22/2012)

Which of the following is true?

Bacteria cannot live in the human body.

Once inside the human body, bacteria cannot mutate.

Bacteria can mutate within the human body.

Current Events: Genome Detectives Solve a Hospital's Deadly Outbreak (New York Times, 8/22/2012)

Which of the following did scientists discover about Klebsiella pneumoniae?

Current Events: Genome Detectives Solve a Hospital's Deadly Outbreak (New York Times, 8/22/2012)

Which of the following would have most likely prevented so many patient deaths from Klebsiella pneumoniae at this National Institutes of Health hospital?

Building Vocabulary: Word Roots - Metric Prefixes

Can you match these prefixes, suffixes, and word roots with their definitions?

Current Events: Scientific Articles Accepted (Personal Checks, Too) (New York Times, 4/7/2013)

Which of the following would likely be the best for limiting the increasing amount of questionable journals publishing studies of unknown scientific value?

Current Events: Scientific Articles Accepted (Personal Checks, Too) (New York Times, 4/7/2013)

You are a graduate student in plant physiology and wish to learn how to discern reputable scientific publications from those not reputable. Which of the following journals recently created a checklist to help get you started?

Current Events: Scientific Articles Accepted (Personal Checks, Too) (New York Times, 4/7/2013)

You are a toxicologist and wish to publish your recent research concerning the effects of endocrine disruptors on thyroid function. How do you find a reputable journal to publish your findings?

Current Events: Scientific Articles Accepted (Personal Checks, Too) (New York Times, 4/7/2013)

What was the general reaction to open access when it began about 10 years ago?

Current Events: Scientific Articles Accepted (Personal Checks, Too) (New York Times, 4/7/2013)

How do many of these predatory journals find articles to publish and people to serve on their editorial boards?

Current Events: With Shovels and Science, a Grim Story Is Told (New York Times, 3/24/2013)

In which of the following ways are you most likely to contract cholera?

Current Events: With Shovels and Science, a Grim Story Is Told (New York Times, 3/24/2013)

You specialize in matching dental records of missing people to teeth and jaws of criminal cases. What are you?

Current Events: With Shovels and Science, a Grim Story Is Told (New York Times, 3/24/2013)

The cholera outbreak at the shanty was attempted to be controlled by what?

Current Events: With Shovels and Science, a Grim Story Is Told (New York Times, 3/24/2013)

Why did this team hire a geophysicist?

Current Events: With Shovels and Science, a Grim Story Is Told (New York Times, 3/24/2013)

Which of the following led researchers to send John Ruddy's remains back to Ireland?

Current Events: Focusing on Fruit Flies, Curiosity Takes Flight (New York Times, 10/7/2013)

You are an entomologist specializing in the vision of the anthomyiid flies. What is your area of expertise?

Current Events: Focusing on Fruit Flies, Curiosity Takes Flight (New York Times, 10/7/2013)

Your cousin is a member of a team of scientists studying different species of fruit flies all over the world. Which continent should the team avoid?

Current Events: Focusing on Fruit Flies, Curiosity Takes Flight (New York Times, 10/7/2013)

Scientists studying hummingbird flight use the same technique as those studying fruit flies. What do they use?

Current Events: Focusing on Fruit Flies, Curiosity Takes Flight (New York Times, 10/7/2013)

Which of the following is necessary for fruit flies to fly?

Current Events: Focusing on Fruit Flies, Curiosity Takes Flight (New York Times, 10/7/2013)

Your friend is an entomologist studying the response to potential predators in the common cricket. To determine if the response is a reflex or a decision, which of the following body parts should he focus on?

Current Events: Image of Hindenburg Haunts Hydrogen Technology (New York Times, 10/29/2013)

Why are so many people concerned with using hydrogen as a fuel?

Current Events: Image of Hindenburg Haunts Hydrogen Technology (New York Times, 10/29/2013)

You start a job working as a specialist for Ore Design and Technology. What are you trying to use to obtain hydrogen to use as a fuel?

Current Events: Image of Hindenburg Haunts Hydrogen Technology (New York Times, 10/29/2013)

You purchase a hydrogen fuel-cell vehicle. Which of the following will need to be present for your car to work?

Current Events: Image of Hindenburg Haunts Hydrogen Technology (New York Times, 10/29/2013)

At this time, what state is leading the way for hydrogen fuel-cell vehicles?

Current Events: Image of Hindenburg Haunts Hydrogen Technology (New York Times, 10/29/2013)

Cars that have hydrogen fuel-cells run on which of the following?

Activity: The Levels of Life Card Game

An organ, such as the liver, is composed of _____.

*Organs are composed of two or more different types of tissues.

Activity: The Levels of Life Card Game

Which of these is an organ system?

*The digestive system is composed of structures such as the stomach and small intestine.

Activity: The Levels of Life Card Game

What are the two main types of cells?

prokaryotes and eukaryotes

*Prokaryotic cells lack the nucleus and other organelles found in eukaryotic cells.

Activity: Heritable Information: DNA

DNA is composed of building blocks called _____.

*DNA is a composed of nucleotide units.

Activity: Heritable Information: DNA

In eukaryotic cells DNA has the appearance of a _____.

*Eukaryotic DNA is organized as a double helix.

molecule, organelle, cell, tissue, organ, organ system, organism, population, community, ecosystem

*Each level of biological structure builds on the level before it.

*A molecule such as a protein has attributes not exhibited by any of its component parts (e.g., amino acids). Therefore, novel properties are emerging that were not present at a simpler level of organization.

Which of the following statements is true about chemical nutrients in an ecosystem?

They depend on sunlight as their source.

They recycle within the ecosystem, being constantly reused.

They exit the ecosystem in the form of heat.

They flow through the system, losing some nutrients in the process.

They cannot be obtained from decomposition.

They recycle within the ecosystem, being constantly reused.

*Nutrients cycle through the ecosystem by processes such as the decomposition of organic debris.

the use of DNA as the information storage molecule

All cells (discovered so far) use DNA to store information.

Which of the following statements is FALSE regarding the complexity of biological systems?

An ecosystem displays complex properties not present in the individual communities within it.

An understanding of the interactions between different components within a living system is a key goal of a systems biology approach to understanding biological complexity.

Understanding the chemical structure of DNA reveals how it directs the functioning of a living cell.

Knowing the function of a component of a living system can provide insight into its structure and organization.

*Plants and certain algae are multicellular photosynthetic organisms included in the kingdom Plantae of the domain Eukarya.

Organisms typically produce too many offspring, and resources are limited.

*Resource competition is one of the main ingredients for natural selection. Organisms must compete for limited resources, and only the best adapted will survive and reproduce.

Which of the following is an example of "unity in diversity"?

All organisms, including prokaryotes and eukaryotes, use essentially the same genetic code.

The forelimbs of all mammals have the same basic structure, modified for different environments.

The structure of DNA is the same in all organisms.

All of the above are correct.

All of the above are correct.

*These are all examples of unity in diversity.

To understand how the scientific method can be used to search for explanations of nature.

The scientific method is a procedure used to search for explanations of nature. The scientific method consists of making observations, formulating hypotheses, designing and carrying out experiments, and repeating this cycle.

Observations can be either quantitative or qualitative. Quantitative observations are measurements consisting of both numbers and units, such as the observation that ice melts at 0∘C . In contrast, qualitative observations are observations that do not rely on numbers or units, such as the observation that water is clear.

A hypothesis is a tentative explanation of the observations. The hypothesis is not necessarily correct, but it puts the scientist's understanding of the observations into a form that can be tested through experimentation.

Experiments are then performed to test the validity of the hypothesis. Experiments are observations preferably made under conditions in which the variable of interest is clearly distinguishable from any others.

If the experiment shows that the hypothesis is incorrect, the hypothesis can be modified, and further experiments can be carried out to test the modified hypothesis. This cycle is repeated, continually refining the hypothesis.

If a large set of observations follow a reproducible pattern, this pattern can be summarized in a law—a verbal or mathematical generalization of a phenomenon. For example, over the years people observed that every morning the sun rises in the east, and every night the sun sets in the west. These observations can be described in a law stating, "The sun always rises in the east and sets in the west."

After a great deal of refinement, a hypothesis can lead to a theory. A theory is an explanation of why something happens. For example, Newton's theory of gravitation explains why objects tend to fall toward the Earth (as well as explaining the interactions between the Earth and the other planets, etc). However, theories can still be further refined or even replaced. Einstein's theory of general relativity was able to better explain certain astronomical observations related to gravity, and therefore it replaced Newton's theory of gravitation (although Newton's theory still holds true under most everyday conditions). Similarly, the geocentric theory (that the Earth is the center of the universe) was replaced by the heliocentric theory (that the Earth revolves around the sun) based on further observations and testing of predictions. Note that a scientific theory is not the same as the popular definition of a theory—namely, a "guess" or "speculation." Instead, a theory is an explanation that can hold up against repeated experimentation. It may not be perfect, but it is the best explanation possible based on available evidence.

In the course of a conversation, you observe that three of your friends like horror movies. Horror movies happen to be your favorite type of movie as well. You also know that all of these friends were born in the same week that you were, even in the same year.

An astrology-loving friend hypothesizes that people born in that week like horror movies more than other genres of movies. You decide to use the scientific method to test this hypothesis.

Which of the following experiments would best test your hypothesis?In the course of a conversation, you observe that three of your friends like horror movies. Horror movies happen to be your favorite type of movie as well. You also know that all of these friends were born in the same week that you were, even in the same year.

An astrology-loving friend hypothesizes that people born in that week like horror movies more than other genres of movies. You decide to use the scientific method to test this hypothesis.

You want to be as careful as possible that the variable of interest--namely, favorite movie genre--is clearly distinguishable from any other variables. To do so, first you must be careful to find a random sampling of people who share your birth week, avoiding simply talking to friends with whom you share common interests. Second, you need to provide your subjects with a questionnaire on which they are asked to circle their favorite genre from a list, so that you are not tempted to interpret their answers in your favor. You must be certain not to tell them what you are seeking to prove or disprove that way their answers will not be influenced by your stated goal. You must also make the surveys anonymous to ensure that your subjects aren't simply giving you the answers they think you want them to give.

After finding a random sample of 10 people born in the same week as you and your friends, you obtain these results from their questionnaires:

4 of them prefer comedies,
3 of them prefer dramas,
2 of them prefer action movies, and
1 of them prefers westerns.

As a control, you also interview 14 random people with birthdays throughout the year. You obtain results similar to the results of your experimental group and your friends:

3 of them prefer comedies,
4 of them prefer dramas,
3 of them prefer action movies, and
1 of them prefers westerns

GraphIt!: An Introduction to Graphing

For the graph of data on stem density and snowshoe hare density in Step 6, which statement best summarizes the trend shown in the graph?

GraphIt!: An Introduction to Graphing

As stem density increases from about 35,000 stems to about 55,000 stems per hectare, what is the increase in snowshoe hare density?

GraphIt!: An Introduction to Graphing

Increasing density of tree and bush stems has a positive effect on snowshoe hare abundance.

GraphIt!: An Introduction to Graphing

Areas with more than 100,000 stems per hectare should have hare densities approaching 4 hares per hectare.

GraphIt!: An Introduction to Graphing

Greater abundance of tree and bush stems results in higher birth rates for snowshoe hares.

GraphIt!: An Introduction to Graphing

For the graph in Step 7 showing fish body lengths and the percentage of tern diets that they comprise, which statement best summarizes the trend shown in the graph?

GraphIt!: An Introduction to Graphing

For what sizes of fish is there the least amount of overlap in the diets of these two tern species?

GraphIt!: An Introduction to Graphing

For the sooty tern, there is a steady decrease in the percent of diet as fish length decreases from 6 cm to 0 cm.

GraphIt!: An Introduction to Graphing

The sooty tern consumes more fish than the blue-grey noddy tern.

GraphIt!: An Introduction to Graphing

Even though they live in the same location, these two species probably experience minimal competition for food.

GraphIt!: An Introduction to Graphing

Since the points don't form a straight line, it would have been better to draw this as a scatter plot.

GraphIt!: An Introduction to Graphing

It would have been reasonable to place stream flow on the X-axis instead of year.

GraphIt!: An Introduction to Graphing

Stream flow rates have apparently been decreasing worldwide since 1966.

GraphIt!: An Introduction to Graphing

Between 1966 and 1995, what has been the approximate decrease in stream flow rates in Monteverde?

Activity: Introduction to Experimental Design

Which of the following statements is not true of scientific experiments?

Activity: Introduction to Experimental Design

In an experiment, investigators try to control all of the variables except one—the one that tests the hypothesis. Which of the following reasons is the primary rationale for controlling variables in an experiment?

To eliminate alternative explanations for the results of an experiment

Controlling all variables but one ensures that some other factor is not responsible for the results obtained from an experiment.

Activity: Introduction to Experimental Design

Which of the following statements could not be supported or rejected by a scientific experiment?

The first living cell on Earth came from outer space

An experiment could not be designed to test this statement. Science neither supports nor rejects this idea.

Activity: Introduction to Experimental Design

Which of the following statements is true of a hypothesis?

A hypothesis can be supported or rejected through experimentation.

A hypothesis is supported or rejected based on the outcome of one or more experiments.

Activity: Introduction to Experimental Design

Which of the following variables did Pasteur change in his experiment to test the hypothesis of spontaneous generation?

By using a swan-necked flask for the experimental treatment, Pasteur ensured that no cells were entering the flask from the air. Thus, any organisms that appeared in the experimental flask would have arisen spontaneously.

Activity: Introduction to Experimental Design

In Pasteur's experiment to test the hypothesis of spontaneous generation, why did he boil the broth in both flasks?

To kill any existing organisms in the broth

Pasteur boiled the broth to kill any existing organisms, thus ensuring that the conditions in each flask were identical (i.e., lacking organisms) at the start of the experiment.

Activity: Introduction to Experimental Design

What results from the Zonosemata experiment support the sub-hypothesis that wing waving alone reduces predation by jumping spiders?

Zonosemata flies with housefly wings are attacked less frequently.

This experimental group tests the effects of wing waving alone.

Activity: Introduction to Experimental Design

Suppose that Zonosemata flies whose own wings had been clipped and reattached were attacked more frequently than untreated Zonosemata flies. How would this result affect the reliability of the other experimental results?

All results for the experimental groups involving wing surgery would be invalid.

This result suggests that the presence or absence of wing surgery itself may affect the jumping spider's responses. Thus, there is not enough information to draw conclusions from the data because there is an alternative explanation for the results of the experiment.

Most species in the insect order Orthoptera (crickets, grasshoppers, locusts) produce a song by rubbing their wings or legs against each other. In most of the species that sing, only the male produces a song.

Crickets are a common example their songs are a familiar night sound in most parts of the continental United States. Some crickets produce a song that is continuous for several seconds or more, while others break their song into a sequence of chirps, typically with 10-50 chirps per minute.

Based on the observation that only male crickets produce a song, you hypothesize that a male's song is a form of communication to potential mates.

You set up a simple experiment to test this hypothesis. In the laboratory, you place a male snowy tree cricket in enclosure A, which is adjacent to enclosure B. In enclosure B, you place other insects, one at a time, and observe their responses to the male's song.

The enclosures are designed so that the two insects being tested cannot see or smell each other, but sound is transmitted from enclosure A to enclosure B.

For each insect below, indicate whether it is part of an experimental group or a control group when placed in enclosure B. Labels may be used once, more than once, or not at all.

male snowy tree cricket - control group

female snowy tree cricket - experimental group

female field cricket - control group

*In this experiment, your hypothesis is that the male's song communicates information to potential mates. The experimental group provides a direct test of the hypothesis. Female snowy tree crickets compose the experimental group because they are the potential mates of male snowy tree crickets and therefore would be expected to uniquely respond to the male's song. Failure of the experimental group to respond to the male's song would require you to reject your hypothesis.

In addition, control groups test other factors that might influence the experimental outcome. In this experiment, the control groups test whether the male's song functions in frightening potential competitors for mates (other male snowy tree crickets) or competitors for food (female field crickets). If one or both of the control groups respond to the male snowy tree cricket's song, you would have to reject your hypothesis.

In the actual experiment, the female snowy tree crickets turn toward the male in response to his song, but the control groups move randomly in response to the song. These results support your hypothesis.

In a controlled experiment, the treatment or evaluation of the experimental group directly tests the hypothesis. In this experiment, the hypothesis is that the male's song is a form of communication to potential mates.

Which of the following groups are potential mates of the male snowy tree cricket?

female snowy tree crickets

In field experiments or experiments with living organisms, it is often not possible to control (keep constant) every variable except the one being tested. In these cases, controls often serve the more general function of testing alternate explanations for experimental results.

In this experiment, some alternate explanations for why the other crickets may respond to the male's song include:

The response of male crickets to the songs of other males reduces competition for potential mates.
The response of female crickets to the songs of other crickets (regardless of sex or species) minimizes competition for food.
Crickets respond to any sounds that resemble cricket songs.

The control groups in this experiment test whether the male's song serves a function other than communicating with potential mates. If one or both of the control groups respond to the male snowy tree cricket's song, you would have to reject your hypothesis.

There are many reports that the number of chirps per minute that a cricket produces is correlated with the ambient temperature. Your class decides to test this hypothesis by collecting several males from two species of crickets: the snowy tree cricket (Oecanthus fultoni) and the common field cricket (Gryllus pennsylvanicus). In the laboratory, you measure the chirp rate of each cricket at four different temperatures. The data are shown in the table below.

Temp : avg chirp rate (cpm)
Temp : snowy tree cricket-common field cricket
(20:108-82)
(30:128-100)


Abstract

High throughput sequencing technologies have become essential in studies on genomics, epigenomics, and transcriptomics. Although sequencing information has traditionally been elucidated using a low throughput technique called Sanger sequencing, high throughput sequencing technologies are capable of sequencing multiple DNA molecules in parallel, enabling hundreds of millions of DNA molecules to be sequenced at a time. This advantage allows high throughput sequencing to be used to create large data sets, generating more comprehensive insights into the cellular genomic and transcriptomic signatures of various diseases and developmental stages. Within high throughput sequencing technologies, whole exome sequencing can be used to identify novel variants and other mutations that may underlie many genetic cardiac disorders, whereas RNA sequencing can be used to analyze how the transcriptome changes. Chromatin immunoprecipitation sequencing and methylation sequencing can be used to identify epigenetic changes, whereas ribosome sequencing can be used to determine which mRNA transcripts are actively being translated. In this review, we will outline the differences in various sequencing modalities and examine the main sequencing platforms on the market in terms of their relative read depths, speeds, and costs. Finally, we will discuss the development of future sequencing platforms and how these new technologies may improve on current sequencing platforms. Ultimately, these sequencing technologies will be instrumental in further delineating how the cardiovascular system develops and how perturbations in DNA and RNA can lead to cardiovascular disease.

Introduction

Until the discovery of retroviruses, the central dogma of molecular biology stated that genes are transcribed to make RNA and in turn RNA is translated into protein. 1,2 This dogma outlines how the variable expression of genes can dynamically control cellular functionality and identity from a single genome. Gene expression is dynamically controlled and variations in transcription and translation can result in major functional changes within the cell. If the underlying DNA sequence is mutated or if the downstream message is changed during transcription and translation, cellular function may be compromised, leading to various disease pathologies. Although our environment affects the manifestation of disease, many diseases also have a strong underlying genetic component. Diseases that have a stronger genetic component than environmental component may include those that surface at birth (congenital diseases) and those which run in families (familial inheritance diseases). To determine how one’s genetic background contributes to disease, a large collection of genomic and transcriptomic data sets is required. By sequencing multiple genomes, it is therefore, possible to evaluate human genomic diversity, as demonstrated by the 1000 Genomes Project. 3,4 In addition, the ENCyclopedia Of DNA Elements 5 and HapMap project 6 have used many of the high throughput sequencing (HTS) applications outlined below to understand the functional attributes of each region of the genome. With advancements in HTS technologies, sequencing costs have now dramatically decreased and it may soon be possible to sequence the entire human genome for ≤$1000. 7 As the price of sequencing decreases, sequencing may become commonplace, which will vastly contribute to our understanding on genomic variability and how this variability may increase one’s susceptibility to develop cardiovascular diseases. Ultimately, by lowering sequencing cost and in turn, making sequencing technologies mainstream, implementation of HTS technologies will be invaluable in determining the molecular pathways involved in cardiovascular development and disease.

First-Generation Sanger Sequencing

DNA sequencing information has traditionally been elucidated using Sanger sequencing. 8 This technique was developed by Dr Sanger, who was subsequently awarded the 1980 Nobel prize in Chemistry. 9 In this method, a complementary strand of DNA is made from the input template DNA from a mixture of 2′-deoxynucleotides, including 2′,3′-dideoxynucleotides, which are labeled with fluorescent dyes. 8 2′,3′-dideoxynucleotides are nucleotides that lack a 3′-OH group required for cDNA elongation and when a 2′,3′-dideoxynucleotide is incorporated into the elongating DNA strand, elongation is terminated, resulting in the generation of multiple DNA fragment sizes. The sizes of these fragments are separated using single base-pair resolution capillary electrophoresis, yielding an electropheragram that seems to be a direct read-out of the nucleotide sequence from the original template molecule. 10 Sanger sequencing can have an average read length of 800 base pairs, but it is limited by the amount of DNA that can be processed at a given time. To address the low throughput, newer sequencing technologies have been developed that can read the sequence of multiple DNA molecules in parallel. Parallel capillary systems greatly increased the throughput of the number of DNA strands that could be analyzed 11,12 because 1 to 6 MB of DNA sequence could be acquired per day in a standard 96-capillary instrument. 13 However, parallel capillary-based systems are still limited by the amount of capillary columns which could be processed at a given time. Because the human genome consists of ≈3 GB, containing ≈20 000 genes that span 45 MB (1.5% of the whole genome), 14–16 it took the Human Genome Project over a decade and billions of dollars to be completed using Sanger sequencing. 17,18

Key HTS Platforms

Commercially available sequencing platforms are expanding the potential of sequencing by exponentially increasing the throughput of their technologies. Although many sequencing platforms are available, Illumina’s platforms (http://www.illumina.com/) have dominated much of the sequencing industry (Figure 1). 19 Illumina’s bridge amplification method allows for generation of small clusters with an identical sequence to be analyzed. Clusters formed on an Illumina flow cell create multiple primer hybridization steps allowing multiple sequencing start points. This allows the sequencing of both ends of the original template molecule known as paired-end sequencing. Sequencing information from paired-end reads play an important role in Illumina’s technology by increasing the output from a sequencing run, identifying splice variants in RNA sequencing (RNA-seq), and to deduplicate (remove duplicate copy) reads originating from the same original template molecule. Paired-end reads are also important for identifying large structural variants, such as inversions from whole-genome sequencing, which are unnoticed with short sequencing techniques. A third read may also be used to separate out samples as long as each sample in the sequencing library had a unique barcode read engineered into the adapter construction as a third separate read. In Illumina sequencing however, all 4 nucleotides are available during incorporation which can lead to an overall substitution error rate of 0.11%. 20

Figure 1. Overview of DNA sequencing using the Illumina platform. In next-generation DNA sequencing, DNA is first fragmented into smaller input-sized fragments by enzymes or by sonication. The ends of these fragments are repaired and specific adapters are ligated to the ends of the fragments, allowing hybridization to a flow cell to occur. A bridge amplification step is performed to create a cluster of fragments with the same sequence. One strand of DNA is removed and fluorescently labeled nucleotides are passed by each cluster. An image of the flow cell is recorded for the first cycle and a computer processes which nucleotide was incorporated at each cluster’s co-ordinates. The fluorescent label is cleaved and a second round of fluorescently labeled nucleotides is passed by each cluster. Again (cycle 2) the nucleotide is recorded and each cycle leads to the sequence of each fragment (a read). These reads are then aligned to a reference genome. By assembling reads (merging short reads together), it is therefore, possible to reconstruct the unfragmented original sequence.

Both Ion Torrent (http://www.iontorrent.com/) and 454 (http://www.454.com/) employ the use of polymerase chain reactions to amplify DNA within an emulsified droplet (Figure 2). Sequencing information is correlated with either light (in 454) or hydrogen ions (in Ion Torrent) detection during each nucleotide incorporation event. If multiple nucleotide incorporation events occur, this is interpreted as stretches in the sequence of a particular nucleotide (homopolymers). All HTS technologies have difficulties sequencing homopolymers, but sequencing homopolymers in Ion Torrent and 454 is more problematic because the nucleotides used lack a blocking moiety, resulting in entirely incorporating homopolymers during 1 cycle. Difficulties in interpreting homopolymers arise because these homopolymer signals are nonlinear and have a Poisson distribution. This results in homopolymers of just 2 or 3 nucleotides sometimes be contracted or expanded.

Figure 2. Ion Torrent and 454 sequencing. Both Ion Torrent and 454 immobilize DNA fragments onto beads. In both platforms, template molecules are first immobilized on a bead which is emulsified so that subsequent amplification can occur clonally within the droplet. After clonal amplification, enrichment for DNA-positive beads is performed using additional beads that can bind and isolate the available end of the library molecule, thus removing DNA-negative beads. Enriched beads are deposited at the bottom of a well and sequencing is performed by flowing 1 base at a time over the templates. In Ion Torrent, an incorporation event is measured by a pH change from the release of protons resulting from the incorporation, whereas 454 uses a cascade of reactions resulting from pyrophosphate being released from each incorporation reaction. This leads to a photon being released by the enzyme luciferase. Therefore, the detection of light (in 454) or hydrogen ions (in Ion Torrent) when adenine is passed over each chamber is interpreted as thymine being the next nucleotide in the DNA sequence. Amplification of DNA fragments occurs in an emulsion and each bead is placed into a well large enough for each bead. Nucleotides are sequentially passed by each well where nucleotide incorporation occurs. If nucleotide incorporation occurs, a series of enzymatic reactions occurs that results in light being detected in the 454 platform. In Ion Torrent platforms, nucleotide incorporation results in the release of hydrogen ions and these ions are detected by each well. In Ion Torrent, if homopolymer repeats of the same nucleotide are present (GGG), multiple hydrogen ions will be released, generating a higher electric signal. This is subsequently interpreted as multiple identical nucleotides being present in the sequence.

The Pacific Biosciences Real Time Sequencer (RS) (www.pacificbiosciences.com/) requires that each circular library molecule be bound to a polymerase enzyme as the input for sequencing on their single-molecule real-time sequencing cells. The library/polymerase complex is diffused over the single-molecule real-time cell’s zero-mode waveguides, allowing the template to occupy the lumen of the zero-mode waveguides sometimes. The RS uses video imaging of fluorescent nucleotides pausing at the bottom of the zero-mode waveguides to record an incorporation event. 21 Because the pausing can range from 1 to 3 seconds per incorporation event and nucleotides can diffuse freely into the zero-mode waveguides, insertions are more common. It is possible to get multiple passes of sequencing around the same circular library molecule to generate what is called circular consensus sequencing because of the long read length. 22 The RS has seemingly random errors, whereas the other sequencing technologies tend to be less random with mistakes. These random errors, combined with multiple passes over the same circular fragment, generate a relatively low number of high-quality reads, allowing the RS to be used as a cheaper variant validation tool over Sanger sequencing. For true single-molecule sequencing, no amplification should occur on the sample to avoid amplification bias. Native DNA contains modified bases like 5-methyl cytosine that can be measured directly on the basis of signature pausing signals with Pacific Biosciences sequencing. For now, the lack of amplification may be the biggest drawback of single-molecule sequencing because some samples are just too low in material. This highlights the importance of creating new tools that can manipulate smaller reactions and use less input material.

Each sequencing instrument is limited as to the number of rounds it can record, as well as the accuracy of the recording. In addition, there is a limit on the number of template molecules that can be read on each sequencing cell. Table 1 summarizes some of the main sequence platforms on the market today in terms of their relative costs, sequencing yields, quality scores, and sequencing times.

Table 1. Comparison of the Current High-Throughput Sequencing Platforms

A cross comparison of Illumnia (Hiseq2500, Miseq), Ion Torrent (PGM 318, Proton I), Pacific Biosciences Real Time Sequencer (PacBio [RS]), and Roche 454 (FS FLX+, GS Junior) high-throughput platforms is presented. Each instrument was compared to show the specifications provided by the vendors, including the cost, speed, accuracy, primary error type, and size of the data set that can be expected from each instrument. From those specifications, a cost per megabase (MB) index was calculated. As each instrument needs to be maintained over time, the instrument maintenance cost was also provided by the vendors. GB indicates gigabase and PacBio RS, Pacific Biosciences Real Time Sequencer.

Common Applications in High-Throughput Sequencing

DNA Sequencing

The order of DNA sequence and its variation dictates human developmental processes, uniquely identifies each person, and encodes our susceptibility to diseases. 23–25 Using high throughput DNA sequencing (DNA-seq) technologies, it is possible to identify genetic variants that play a role in human health. In whole-genome sequencing, sequencing information of both the exons and the introns is obtained, 15,26,27 which may provide critical information on enhancer regions, promoters, and cis/trans regulatory elements that reside in the intronic regions along with structural variants, such as copy number variants, inversions, and translocations affecting the exonic regions. To give a confidence level of how accurate the sequencing information is, the term depth is used to define the average number of times each nucleotide in the genome is observed. 28 For example, if each nucleotide of the genome is reported an average of 10× during sequencing, a depth level of 10× is obtained. Read depth is important to interpret structural variations because for a given interval, an increase in the amount of reads at a given read depth may indicate an increase in copy number, whereas a decrease in the amount of reads at a given read depth may indicate deletions 29 (Figure 3). In general, as the read depth level increases, the sequencing information becomes more confident. It is recommended that an average read depth of 30× produces an adequate coverage level (the amount of times a nucleotide is reported within an assembled sequence) for whole-genome analysis and at a 50× read depth, ≈94.9% of single-nucleotide variants (SNVs) can be observed. 30–32 Read depth is important in considering the experimental goals. To observe small changes (ie, point mutations) associated with complex diseases, a high read depth and sequencing of multiple individuals is required. However, observing large structural changes in comparison with a reference genome could be achieved with a very low read depth. In most cases, increasing read depth is more costly and alternative, targeted, and more cost-effective methods may be preferred over HTS (ie, customized hybridization chips). In addition, given that in each sequencer run only a limited number of sequence fragments can be read, it may be more cost effective to only analyze the sequence of the genetic material that is transcribed into mRNA (exons), depending on the experimental needs. In this regard, whole exome sequencing uses sequence capture methods to enrich a subset of genomic DNA 33–35 using commercially available capture arrays (Roche NimbleGen, Agilent SureSelect, and Illumina 62 MB). These arrays work with a set of binding oligonucleotides complementary to the human exome bound to magnetic beads. Further magnetic bead isolation not only enriches exome DNA, but may also introduce sequencing bias because some capture methods may not uniformly capture target DNA. However, isolating the exome provides relatively inexpensive 100× coverage of the genome-coding regions and may be particularly useful for identifying rare genomic variants from a population of cells.

Figure 3. Detection of single-nucleotide variants (SNVs) and copy number variants (CNVs). Mapping HTS reads to control annotations (unaffected family member, reference annotation) is used to identify single-nucleotide changes up to large structural DNA changes. Discrepancies between the reference annotation and the mapped sequence can be called to annotate SNV. More reads mapped (ie, higher read depth) results in a higher confidence level to the called SNV in question. If a disproportionate amount of reads are mapped to a gene for a given read depth, this region of DNA may be interpreted as a CNV.

Bioinformatics analysis of DNA sequencing has come a long way since the original chromosome walking technique used at the beginning of the Human Genome Project. Shotgun sequencing was developed to use high throughput short read technologies to assemble large genomes de novo. Large contiguous regions (contigs) are assembled from shorter ones using overlapping regions to link contigs, or reads, and the overhangs are used to extend the contig. Once a reference genome has been assembled for a species (or sometimes an individual), alignment within the reference is possible. For example, one could align reads to the reference genome (using programs, such as burrows-wheeler alignment, 36 short oligonucleotide analysis package 3, 37 and blat-like fast accurate search tool 38 ) or call SNVs (using genome analysis toolkit, 39 mapping and assembly with qualities, 40 or sequence alignment/map tools 41 ) and compare whole genomes through consensus with some flexibility allowed for variants/differences from a reference. Difficulties in DNA sequencing analysis lie in obtaining coverage in regions of extreme GC/AT content, discerning sequencing and amplification errors from actual variants (especially in the case of heterozygous variants and rare variants from a population of cells) and being able to use short reads to assemble large repeats and large structural variants, such as inversions. In addition, errors in mapping short reads can also occur given the ambiguous highly repetitive genomic regions and highly homologous gene families. To overcome these difficulties, newer software algorithms, longer read lengths, and higher coverage will need to be developed.

HTS has the potential to play an important role in cardiovascular research because many diseases have an unknown underlying genetic component. For example, arrhythmogenic right ventricular dysplasia (ARVD)/cardiomyopathy is caused by mutations associated with desmosomes. However, nondesmosomal mutations in transforming growth factor β 3, ryanodine receptor 2, and transmembrane protein 43 have also been implicated in the disease phenotype of ARVD. 42 Although mutations in multiple genes have been identified to cause ARVD/cardiomyopathy, only in 50% to 60% of ARVD/cardiomyopathy patients could an underlying genetic mutation be found (reviewed by Murray 43 ). In addition, some clinical presentations of ARVD/cardiomyopathy are very similar to Brugada syndrome (predominantly exhibited by men, associated with familial inheritance, and exhibits idiopathic ventricular fibrillation). 44 Histopathologic or advanced imaging modalities are required to distinguish between these 2 diseases. 43 Whole-genome and exome sequencing will lead to the discovery of previously unknown mutations that cause cardiovascular diseases and aid in the distinction between diseases that share very similar clinical presentations.

High throughput DNA sequencing will be instrumental in the screening and diagnostics of heart diseases related to larger structural genomic changes, such as Down syndrome, 45 DiGeorge syndrome, 46 4q-syndrome, 47 and 8p-syndrome, 48 as well as complex diseases related to copy number variants 49 and single-nucleotide changes (single-nucleotide polymorphisms [SNPs], SNVs, and mutations). SNVs are variable regions of the DNA in which single-nucleotide differences have been identified in the genetic code, whereas a SNP is a variant that appears with a >1% minor allele frequency in the population. 15,50 These observed polymorphisms may help predict the susceptibility of a patient cohort to develop heart disease. This is exemplified in the study by Matkovich et al, 51 where pooled sequencing data from 4 cardiac signaling genes identified a greater representation of specific SNPs within the cardiovascular heat shock protein gene (heat shock protein family, member 7 [HSPB7]) from patients with heart failure. In addition, while 1 SNP was found to be within an intron of HSPB7, no differences were observed in the splicing or mRNA levels of this gene. Sequencing the adjacent renal chloride channel, voltage-sensitive Ka (CLCNKA) gene, however, identified a SNP in an exon of this gene, which demonstrated linkage disequilibrium with the intronic SNP in HSPB7. Further functional characterization of the renal chloride demonstrated an ≈50% loss-of-function of the variant channel. 52 In summary, HTS performed in these studies led to the identification of a common genetic risk factor for heart failure.

Given that SNV analysis can determine which genetic regions may influence a patient’s susceptibility to develop heart disease, SNV analysis could also be used to determine which drug therapy will be best suited for a particular patient. For instance, warfarin is an anticoagulant drug often prescribed to prevent thrombosis, and patients with common SNPs in the cytochrome P450, family 2, subfamily C, polypeptide 9 (CYP2C9) and vitamin K epoxide reductase complex, subunit 1 (VKORC1) genes have been suggested to successfully predict a patient’s response to the anticoagulant effects of warfarin. 53,54 Further clinical studies will be required to warrant the use of SNP data to predict warfarin treatment. In addition, SNP analysis is being used to identify which SNPs can be either cardioprotective versus cardiotoxic to the effects of doxorubicin. 55–57 Future SNP analysis studies will be important for optimizing patient-specific treatment to existing cardiovascular drugs and for determining the effectiveness and safety of drugs under development. 58

Chromatin Immunoprecipitation Sequencing

Gene expression can be influenced by epigenetic modifications which can be assessed by chromatin immunoprecipitation sequencing (ChIP-seq). DNA in the nucleus is divided into actively transcribed regions called euchromatin or transcriptionally silenced regions called heterochromatin. 59 These regions represent loosely or tightly compact DNA regions and these different states are influenced by histone protein modifications. 60,61 Histone acetylation and methlyation are 2 modifications for histones, and depending on the histone modification, genes may be actively transcribed or repressed during these processes. For example, H3K27Me3 modification represses gene expression, 61 whereas H3K4Me3 modification enhances gene activity. 62 By performing chromatin immunoprecipitations with antibodies toward various histone modification states and sequencing the resulting immunopreciptated DNA, it is possible to assess different regions of DNA that may be actively transcribed or are transcriptionally silent.

In ChIP-seq, formaldehyde is first used to covalently bond DNA to proteins with which they are interacting (Figure 4). The DNA–protein complex is fragmented and immunoglobulins specific for the protein of interest are used to pull down the fragment of DNA to which they are attached. 63 From here the target DNA is isolated and a sequencing library is made using a standard library preparation method. Sequencing of a ChIP-seq library generates reads that align near the genomic regions associated with the target protein. Controls include a negative control antibody library and an input DNA library. Given that some antibodies are better at pulling down the target protein–DNA interactions, the use of a ChIP-certified antibody greatly improves the signal-to-noise ratio in downstream analysis. In addition, under- or overparaformaldehyde crosslinking, as well as under/over DNA sheering, can affect downstream ChIP-seq analysis. Bioinformatic analysis of ChIP-seq data involves mapping reads to a reference genome and using peak detection software (currently >31 open-source programs) 11 to identify regions that have enriched mapping frequency. Therefore, the peak detection of immunoprecipitation from H3K27Me3 and H3K4Me3 histone modification states can provide information on which regions of the DNA are in an active or open chromosome state versus transcriptionally silent.

Figure 4. Chromatin immunoprecipitation for chromatin immunoprecipitation sequencing (ChIP-seq). To determine sequences of DNA that may be open for transcription, ChIP-seq uses the immunoprecipitation of chromatin bound to different histone modification states (transcriptionally silent [H3K27me3] and transcriptionally active [H3K4Me3] markers). Crosslinking chromatin to the histones is first performed using formaldehyde, and chromatin is sheared into smaller fragments. Immunoglobulins specific for each histone modification state are incubated in separate reaction tubes, and magnetic or Sepherose beads known to bind to immunoglobulins are used to isolate bound immunoglobulin/chromatin complexes. Chromatin is separated from the histones by reverse crosslinking (high salt conditions) and protein digestion (proteinase K). DNA libraries are then made from the isolated DNA and sequencing of these DNA fragments is processed using one of the high throughput sequencing platforms.

An additional important application of ChIP-seq is to determine which genes are bound by transcription factors and to determine which enhancer regions are active in the heart. In a study by Blow et al, 64 ChIP-seq of the transcriptional coactivator protein p300 on embryonic day 11.5 heart tissue was used to elucidate which enhancer regions are active in the developing mouse heart. ChIP-seq can also be used to determine how aberrations in transcription factor binding can disrupt gene expression and lead to cardiovascular diseases. Given that NK2 homeobox 5 (NKX2-5) mutations have been shown to cause hypoplastic left heart syndrome, atrial septal defects, and patent foramen ovale, 65 immunoprecipitation of control and mutant NKX2-5 could be used to identify how NKX2-5 mutations affect transcription.

Identifying gene and protein interactions can also be determined using chromosome confirmation capture sequencing. Chromosome confirmation capture sequencing is important for identifying functional associations among distal chromosomal regions (such as enhancers). 66,67 These DNA interactions can be determined by first crosslinking DNA/protein complexes using formaldehyde and then using restriction enzymes to digest DNA into smaller fragments, leaving the crosslinked DNA fragments connected. DNA is then intramolecularly ligated and reverse crosslinked with heat. From here adaptors can be ligated on to generate a sequencing library. In a study by Korostowski et al, 68 chromosome confirmation capture sequencing was used to demonstrate how changes in chromosomal interactions occur with the promoter of the potassium voltage-gated channel, KQT-like subfamily, member 1 (Kcnq1) and interactions with the Kcnq1 promoter were demonstrated to influence the transition of a monoallelic to a biallelic expression of this gene during development of the heart. Future studies using chromosome confirmation capture sequencing will not only be important to determine which distal chromosomal regions interact but also to explain how these interactions occur temporally or in a tissue-specific manner. By delineating which regions of DNA interact spatially, a deeper and more complete understanding into the mechanisms causing cardiovascular disease can be achieved.

Methylation Sequencing

Methylation of DNA is another epigenetic modification that can influence gene expression and the methylation status of DNA can be determined by methylation sequencing. 5-Methyl cytosine is the most common modified base in humans and generally methylation of cytosine occurs when cytosine neighbors a guanine nucleotide called cytosine guanine dinucleotides (CpGs). 24 Areas of the genome high in CpG concentration have increased methyl transferase activity and may be referred to as CpG islands. Methylation at CpG islands decreases the activity of promoters and generally decreases gene expression. DNA methylation at CpG islands has been shown to strongly suppress promoter activity and seems to occur as a function of age, causing loss-of-function phenotypes, and may be a target in many disorders, including heart disease. 69 One method to sequence methylated regions of DNA (methylation sequencing) involves first isolating DNA, fragmenting the DNA, and then separating this sample into 2 reactions. One reaction is treated with bisulfite and the other portion is left untreated. Bisulfite treatment changes cytosine nucleotides to uracil nucleotides, leaving methylated-cytosine nucleotides unchanged because they are resistant to bisulfite treatment. 70

Methylation sequencing data analysis involves sequencing bisulfite-treated DNA and comparing this with the fraction that did not undergo bisulfite treatment, producing differences that can identify the regions that did not convert to uracil a uracil is read as a thymidine during sequencing and is reported as a methylation site. Incomplete bisulfate conversion can be problematic because these regions will be detected as a methylated site. One method to determine whether bisulfate conversion went to completion is the addition of spike in control DNA in which the methylation status is known. Methylation-rich regions can also be immunoprecipitated using an antimethylated cytosine antibody. In a study by Movassagh et al, 71 immunoprecipitation of methylation regions led to genome-wide DNA methylation patterns that were quite similar between control and end-stage cardiomyopathy hearts. Moreover, differences in methylation could be observed when analyzing the methylation pattern at the single gene level. Identification of the gene promoters that are hypermethylated versus hypomethylated may therefore, be useful in predicting which gene will become active during various stages of heart disease.

Transcriptome Sequencing (RNA-seq)

RNA-seq is particularly useful in assessing the current state of a cell or tissue and the possible effects of disease states or treatment conditions on the transcriptome. RNA-seq also provides information on the differences between the transcriptome and the exome that result from RNA editing. Although microarrays have revolutionized the study of transcriptomics and proved useful in determining gene expression profiles, RNA-seq by comparison is more sensitive, provides absolute quantity levels, is not affected by on-chip sequence biases, and gives additional information on gene expression levels and splice junction variants. 72,73

In RNA-seq, RNA is commonly first converted to a more stable cDNA through a combination of reverse transcription and the selection process to isolate the RNA from the abundant rRNA. The input RNA quality is very important in RNA-seq preparation because RNAse enzymes are ubiquitous and extremely stable and fragmentation can also occur simply when a divalent cation is present. Library preparation and sequencing of cDNA follow the same sequencing procedure as DNA-seq. However, numerous variations of RNA-seq library preparations have been developed, each with its benefits and limitations in terms of relative costs and input requirements. The main differences in these various library preparations are the methods of purifying and isolating RNA of interest (mRNA, uRNA, full-length transcripts, etc.). RNA-seq libraries can be made using polyadenylated tail selection, not-so-random primers (for reverse transcription), and ribosomal depletion. 74–76 Isolating polyadenylated mRNA and then reverse transcribing is the conventional method of preparing an RNA-seq sample, but it favors the 3′ end of transcripts and does not work well with low-quality or degraded samples, or provide any information about noncoding RNA. A commercially available kit (Clontech SMARTer) is available to generate full-length cDNA from high-quality, low-input RNA samples using the 3′ poly-A tail as the priming site for first-strand cDNA synthesis and by enzymatically adding on a specific primer hybridization site on to the 5′ end after first-strand synthesis for the second strand. Two other general methods have been commercially developed to selectively remove rRNA. For example, a method for selectively amplifying nonribosomal RNA offered in a sequencing preparation kit (NuGEN Ovation) involves the use of a designed set of reverse transcription primers that contain all variants of random oligonucleotides (random primers), excluding the ones that would amplify ribosomal RNA (nonrandom primers). Ribosomal depletion immobilizes ribosomal RNA to remove it before reverse transcription. The use of one of these methods can recover additional RNA signals that would not be otherwise obtained via a poly-A selection because of the degraded and noncoding RNA. To annotate de novo discovery, strand-specific RNA-seq is used to determine which strand of RNA was the original template in reverse transcription. By preprocessing RNA to select for polyadenylated mRNA, or by selectively removing ribosomal RNA, a greater sequencing depth can be achieved. Depending on the experimental design, a greater sequencing depth may be required when complex genomes are being studied or whether information on low abundant transcripts or splice variants is required.

In general, bioinformatic analysis consists of aligning the sequence reads to a reference genome, assembling the reads into transcripts, and detecting differences in transcript expression between or among groups. The Tuxedo Suite consisting of Bowtie, 77 Tophat, 78 and Cufflinks 79 can be used as open-source software packages to perform these operations and multiple updates to these software packages have increased the speed and accuracy in RNA-seq analysis. Additional splice variant detection and alternative exon usage can be identified using software packages, such as mixture-of-isoforms 80 and differential exon usage in RNA-seq, 81 which can quantify reads to individual exons. Although many of these software packages give a probabilistic framework to identify changes in transcript splicing patterns, false discovery cutoffs are necessary to identity true splicing events.

Transcriptome changes assessed by RNA-seq play a valuable role in cardiovascular medicine because transcriptome changes can identify how cardiovascular diseases change with time. Lee et al 82 used RNA-seq to study how murine hearts change during heart failure, whereas Song et al 83 used RNA-seq to decipher the transcriptome differences between physiological hypertrophy and pathological hypertrophy. In addition, Hu et al 84 studied the mRNA and microRNA transcriptome changes that occur during pressure overloading hypertrophy in mice hearts by HTS. By identifying which mRNAs and microRNAs changed during hypertrophy, and which microRNA–mRNA interaction occurred with immunoprecipations of argonaute 2 RNA-induced silencing complex sequencing, they demonstrated that small changes in microRNA expression could lead to global mRNA changes during heart stress. Other results show that the use of RNA-seq with ChIP-seq information has provided significant advances in the study of how transcription factor binding can influence changes in gene expression. To demonstrate the potential of this approach, RNA-seq was performed on the hearts from Tbx20 knockout mice that had rapidly developed heart failure. By combining the transcriptome changes that occur from the loss of Tbx20 with the putative transcription factor binding sites of Tbx20 previously identified with ChIP-seq, 85 a comprehensive analysis of how the loss of Tbx20 leads to heart failure was achieved. 86 In addition, combining ChIP-seq along with RNA-seq was also used to successfully identify genes and chromatin marks involved in the progression of cardiomyocyte differentiation from human induced pluripotent stem cells. 87

Ribosome Sequencing (Ribo-seq)

The RNA content within the cell does not automatically lead to the production of functional proteins. Although the analysis of the total RNA can give an overview of the current RNA fragments present in the cell, selecting the RNA that is bound to ribosomes can offer a better indication of which RNA fragments are in the state of translation. Studying the sequence of RNA bound to ribosomes is called ribosomal footprinting and sequencing these short RNA fragments is called Ribo-seq. 88,89 Ribo-seq is particularly useful when studying transient transcriptional events that are tightly controlled, including mitosis. 90 To perform Ribo-seq, cycloheximide treatment is first used to block the elongation phase of eukaryotic translation. 91 Cells are lysed and fragmentation of RNA is performed (RNase I treatment). Ribosomes containing short fragments of RNA are then separated by ultracentrifugation and the short RNA fragments are separated from the ribosomes using proteases (proteinase K). These isolated short RNA fragments are sequenced to indicate which RNA fragments were being actively translated. Although this technique can provide valuable information on how the cell’s translational machinery operates, it is also more technically challenging, and consequently only a few studies have used Ribo-seq to investigate the processes of cardiovascular diseases to date.

Future of Sequencing Technology

Some observers have compared the pace of advancements in DNA sequencing with that seen in the computer industry, which has been able to reduce costs and processing time exponentially since its inception. However, genetic sequencing is less mature as a field and faces many more technical challenges ahead. For instance, accuracy is paramount for sequencing to become more widely applicable in the clinic, a goal that may be amenable with better algorithms that can correct for reading errors and advanced molecular biology techniques and applications. One application that may prove useful in cardiovascular medicine is cell-free DNA-seq and RNA-seq. Given that one of the components of circulating whole blood is cell-free nucleic acids, sequencing these cell-free nucleic acids may indicate the current state of the cardiovascular system. RNA is short lived in the presence of RNAses and this makes RNA present in the blood a good temporal measure of what is occurring in the body at the time of extraction. Therefore, DNA-seq and RNA-seq from cell-free nucleic acids could prove to be a relatively noninvasive measure of cardiovascular health.

Recent advances have led to more precise control of picoliter scale volumes and chemical reactions. The next milestone may be to sequence unamplified, unmodified native nucleic acids. There are single-molecule technologies, such as the Pacific Bioscience RS, but they do require adaptor ligation to modify the sample before sequencing can take place. The future direction of sequencing technologies will likely involve methods of directly sequencing single molecules of DNA or RNA in native form from low-input starting material (eg, a few cells containing ≈50 pg of DNA/cell), without sacrificing accuracy or cost. 92 Automated microfluidic sample prep methods are being developed that can isolate a single cell’s genetic material and process it into a sequencing library all in one closed system.

Nanopore Sequencing Technologies

Oxford Nanopore (http://www.nanoporetech.com/) is developing a nanopore technology that someday may be capable of sequencing unmodified miRNA and mRNA molecules. Nanopore sequencing technologies comprise a relatively newer set of techniques being developed by companies, such as Oxford Nanopore and NABsys (http://www.nabsys.com/), which are working on massively parallel sequencers not based on sequencing by synthesis. The concept is to electrophorese molecules through a pore of a membrane and then measure the electric current through the pore as molecules pass through 93 (Figure 5). By characterizing the current of a pore over time, nanopore technology may be able to determine exactly what has traversed the pore and in which order. These companies hope to develop products that can determine the entire genome, sense and antisense, from small amounts of unmodified input in a short amount of time. 94 This technology will also be used to measure RNA and proteins directly someday, and a major challenge will be to control the flow of DNA through the pore and to decipher the resulting message. 95,96 To tackle this problem, Oxford Nanopore is using an approach in which an exonuclease releases one nucleotide at a time, whereas NABsys has adopted an approach using a series of oligonucleotide probes hybridized to the denatured sample that subsequently can be seen crossing the pore and positioning along the template molecule.

Figure 5. Nanopore technology. Third-generation sequencing is expected to measure the change in ion flow (current) within a membrane as small molecules are passed through a small pore inside the membrane. Different current profiles will, therefore, indicate which nucleotide passed through and in which order.

Conclusion

Advances in HTS technologies are enabling a more accurate and comprehensive representation of cardiac development and disease processes. Although more researchers are using HTS to study cardiovascular medicine, the full potential of current HTS sequencing platforms in cardiovascular medicine has yet to be realized. HTS will be essential in identifying biomarkers of disease, staging disease progression, and linking genotypic to phenotypic outcomes. Given the rapid pace of development in sequencing technology during the past decade, future sequencing technologies promise to further help us understand the roles that the genome, transcriptome, and proteome play in the cell by identifying cellular mechanisms. This may also lead to deeper and more comprehensive insights into disease mechanisms at a subcellular level, possibly connecting causation effects to gene expression levels. Because different platforms improve depth (by increasing in read length and decreasing in sequencing time), and decrease in cost, improvements in HTS applications will provide a more complete molecular picture into the functionality of biological processes. Ultimately, improved understanding of these biological processes may lead to dramatically safer and more effective therapies for cardiovascular diseases.