Why did high A+T content create problems for the Plasmodium falciparum genome project?

Why did high A+T content create problems for the Plasmodium falciparum genome project?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

The main paper for the Plasmodium palciparum genome project (Gardner et al., 2002) repeatedly mentioned that the unusually high A+T content (~80%) of the genome caused problems. For example they imply that it prevented them using a clone-by-clone approach:

Also, high-quality large insert libraries of (A + T)-rich P. falciparum DNA have never been constructed in Escherichia coli, which ruled out a clone-by-clone sequencing strategy.

And that it made gene annotation difficult:

The origin of many candidate organelle-derived genes could not be conclusively determined, in part due to the problems inherent in analysing genes of very high (A + T) content.

What is the biological significance of high A+T content, and why would it cause problems in genome sequencing?

Gardner, M.J., Hall, N., Fung, E., White, O., Berriman, M., Hyman, R.W., Carlton, J.M., Pain, A., Nelson, K.E., Bowman, S., Paulsen, I.T., James, K., Eisen, J.A., Rutherford, K., et al. (2002) Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 419 (6906), 498-511.

The sequencing technologies that were developed in the last 20 years have a range of optimal use at an average A+T/G+C rate. Both highly AT-rich and GC-rich regions are complicated to process by the different sequencing technologies. Each technology has different ranges of usage, but to name one, Illumina technology prefers sequences in the middle range. If you try to sequence an AT-rich genome with the Illumina standard protocol, you will sequence an incomplete genome, the fragments of which are not a perfect reflection of the original complete genome. Other technologies claim to be completely unbiased to nucleotide content. Pacific Biosciences is one of them, and people seem to agree on that claim, after having analyzed the data that is produced by their machines. Oxford Nanopore Technologies claims that they have almost no biases, but as of today (2012-06-13), there is no confirmation of that by external analyses.

Beyond sequencing problems, the software used to assemble and annotate the sequences may also be prone to errors in AT-rich and GC-rich regions. But many of those problems stem from the incompleteness of the sequencing.

I can't comment on how A+T richness complicates the sequencing process itself, but I can comment on complications that arise when annotating the sequence. Ab initio gene predictors are often based on hidden Markov models that are very sensitive to base composition in the genome (di-nucleotides, tri-nucleotides, etc). These gene finders typically perform very poorly if they are run on a genome that has a much different base composition than the one on which it was trained. This could explain some of the difficulty they has with analyzing genes in the genome.

Often sequencing involves a step of amplification of genomic material. The standard way to perform this is with PCR, but PCR is biased and does not amplify very AT-rich regions well. With multiple rounds of PCR, even low-abundance regions that are not as AT-rich might come to dominate the sample and hide the AT-rich sequences.

This is not only a problem for de novo sequencing, but for many sequencing-based techniques (RNA-seq, ChIP-seq, your-favorite-seq… ). Alternative methods have been employed in plasmodium, but they are not as standard (yet?).

See, for example, H2A.Z Demarcates Intergenic Regions of the Plasmodium falciparum Epigenome That Are Dynamically Marked by H3K9ac and H3K4me3 at

In the past, before massively parallel sequencing, they made a library of cloned sequences and transformed these into E. coli. High AT sequences are difficult to maintain in E. coli (perhaps due to similarity to promoters?).

A lot has already been said in previous answers so I am just gonna add briefly two potential issues with strong AT/CG bias:

1) Potential for polymerase slippage due to homopolymers: this introduces errors in general because you may have unwanted indels in the reads as well as purely incorrect bases being incorporated. This is a problem that can happen even with PCR (although there's a lot of choices now if u want to spend). So in general higher error rates and higher read failure.

2) Difficulty of the machine to separate the signals of individual nucleotides for SANGER (it gets all blurred) or calibration errors with next gen sequencing. So higher read failure (bad quality).

3) Assuming everything is now fine, still lower complexity regions can be VERY hard to map, let alone assemble a complete genome from scratch.

Hope this helps!

Expression profiling of the schizont and trophozoite stages of Plasmodium falciparumwith a long-oligonucleotide microarray

The worldwide persistence of drug-resistant Plasmodium falciparum, the most lethal variety of human malaria, is a global health concern. The P. falciparum sequencing project has brought new opportunities for identifying molecular targets for antimalarial drug and vaccine development.


We developed a software package, ArrayOligoSelector, to design an open reading frame (ORF)-specific DNA microarray using the publicly available P. falciparum genome sequence. Each gene was represented by one or more long 70 mer oligonucleotides selected on the basis of uniqueness within the genome, exclusion of low-complexity sequence, balanced base composition and proximity to the 3' end. A first-generation microarray representing approximately 6,000 ORFs of the P. falciparum genome was constructed. Array performance was evaluated through the use of control oligonucleotide sets with increasing levels of introduced mutations, as well as traditional northern blotting. Using this array, we extensively characterized the gene-expression profile of the intraerythrocytic trophozoite and schizont stages of P. falciparum. The results revealed extensive transcriptional regulation of genes specialized for processes specific to these two stages.


DNA microarrays based on long oligonucleotides are powerful tools for the functional annotation and exploration of the P. falciparum genome. Expression profiling of trophozoites and schizonts revealed genes associated with stage-specific processes and may serve as the basis for future drug targets and vaccine development.


Plasmodium vivax is the most widely distributed human malaria and responsible for 70� million clinical cases each year and large socio-economical burdens for countries such as Brazil and India, where it is the most prevalent species [1]. Unfortunately, due to the problem of maintaining this parasite in continuous in vitro culture, the fact that vivax malaria is not as life threatening as falciparum malaria, the low parasitemias associated with natural human infections and the difficulty of adapting field isolates to growth in monkeys, research on P. vivax remains largely neglected. Moreover, the strict species-specificity of the naturally acquired antimalarial protective immune responses, makes it unlikely that a vaccine against Plasmodium falciparum will be active against P. vivax. Together, these data call for a comprehensive research effort to study P. vivax.

A genomics approach was used to accelerate gene discovery in P. vivax by constructing a library in yeast artificial chromosomes using parasites obtained directly from a human patient [2]. Indeed, sequencing of a 155,771 bp telomeric YAC from this library revealed the existence of a multi-gene family termed vir (P. vivax variant genes). vir genes are most likely involved in immune evasion and represents 15�% of the total gene content of the parasite assuming a vir gene copy number of 600� copies per haploid genome [3]. Further sequencing of a 199,866 bp internal YAC clone from this same library identified 41 genes in conserved synteny with a region of chromosome 3 in P. falciparum, but found the YAC sequence to lack orthologs of the P. falciparum genes that code for cytoadherence phenotypes within the same region [4].

Large-scale sequence analysis of two mung-bean nuclease-digested genomic DNA libraries: the Pv MBN library from the Belem strain [5] and the Pv MBN library #30 from the Salvador I strain [6], have also accelerated gene discovery in P. vivax. Indeed, comparative in silico analyses of GSS sequences from these two libraries with GSS and ESTs sequences from libraries of P. falciparum and Plasmodium berghei, increased by at least 10-fold the number of predicted P. vivax genes. Technical problems with extractions of poly(A) mRNA from P. vivax, however, have hampered the construction of cDNA libraries of the parasite destined for high-throughput sequencing [6]. Data on ESTs of P. vivax are, therefore, needed to validate these gene predictions and to create a gene index of this malaria parasite. Most important, data on ESTs of P. vivax will be key to assist in the annotation of the genome of the El Salvador I strain presently sequenced to fivefold coverage by TIGR [7].

The construction of a P. vivax cDNA library obtained with parasite material collected directly from 10 different human patients in the Brazilian Amazon was recently reported [8]. This paper presents a survey of ESTs from this library, which includes similarity analyses, annotations and assignment of gene ontology terminology.

Why did high A+T content create problems for the Plasmodium falciparum genome project? - Biology

The malaria parasite Plasmodium falciparum faces drastic osmotic changes during kidney passages and is engaged in the massive biosynthesis of glycerolipids during its development in the blood-stage. We identified a single aquaglyceroporin (PfAQP) in the nearly finished genome of P. falciparum with highest similarity to the Escherichia coli glycerol facilitator (50.4%), but both canonical Asn-Pro-Ala (NPA) motifs in the pore region are changed to Asn-Leu-Ala (NLA) and Asn-Pro-Ser (NPS), respectively. Expression in Xenopusoocytes renders them highly permeable for both water and glycerol. Sugar alcohols up to five carbons and urea pass the pore. Mutation analyses of the NLA/NPS motifs showed their structural importance, but the symmetrical pore properties were maintained. PfAQP is expressed in blood-stage parasites throughout the development from rings via trophozoites to schizonts and is localized to the parasite but not to the erythrocyte cytoplasm or membrane. Its unique bi-functionality indicates functions in the protection from osmotic stress and efficiently provides access to the serum glycerol pool for the use in ATP generation and primarily in the phospholipid synthesis.

Published, JBC Papers in Press, November 29, 2001, DOI 10.1074/jbc.M110683200

This work was supported by the Deutsche Forschungsgemeinschaft and the Fonds der Chemischen Industrie.The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

The nucleotide sequence(s) reported in this paper has been submitted to the GenBank™/EBI Data Bank with accession number(s).

The Tree of Life

A few days ago I got an email from a colleague who I had not seen in many years. It was from Malcolm Gardner who worked at TIGR when I was there and is now at Seattle Biomed.

His email was related to the 2002 publication of the complete genome sequence of Plasmodium falciparum - the causative agent of most human malaria cases - for which he was the lead author. Someone had emailed Malcolm asking if he could provide details about the settings used in the blast searches that were part of the evolutionary analyses of the paper. The paper is freely available at Nature - at least for now - every once in a while the Nature Publishing Group seems to put it behind a paywall despite their promises not to.

Malcolm was contacting me because I had run / coordinated much of the evolutionary analysis reported in that paper. I note - as one of the only evolution focused people at TIGR it was pretty common for people to come to me and ask if I could help them with their genome. I pretty much always said yes since, well, I loved doing that kind of thing and it was really exciting in the early days of genome sequencing to be the first person to ask some evolution related question about the data.

Malcolm included the email he had received (which did not have a lot of detail) and he and I wrote back and forth trying to figure out exactly what this person wanted. And then I said, well, maybe the person should get in touch with me directly so I can figure out what they really want/need. It seemed unusual that someone was asking about something like that from a 10 year old paper, but, whatever.

As I was communicating with this person, I started digging through my files and my brain trying to remember exactly what had been done for this paper more than 10 years ago. I remember Malcolm and others from the Plasmodium community organizing some "jamborees" looking at the annotation of the genome. At one of those jamborees I met with some of the folks from the Sanger Center (which was one of the big players in the P. falciparum genome sequencing) with Malcolm and - after some discussion I ended up doing three main things relating to the paper, which I describe below.

Thing 1: Conserved eukaryote genes

One of my analyses was to use the genome to look for genes conserved in eukaryotes but not present in bacteria or archaea. I did this to try and find genes that could be considered likely to have been invented on the evolutionary branch leading up to the common ancestor of eukaryotes.

As an aside, at about the same time I was asked to write a News and Views for Nature about the publication of the Schizosaccharomyces pombe genome. In the N&V I had written "Genome sequencing: Brouhaha over the other yeast" I noted how the authors had used the genome to do some interesting analysis of conserved eukaryotic genes. With the help of the Nature staff I had also made a figure which demonstrated (sort of) what they were trying to do in their analysis - which was to find genes that originated on the branch leading up to the common ancestor of the eukaryotes for which genomes were available at the time. As another aside - the S. pombe genome paper and my News and Views article are freely available .

Figure 1: The tree of life, with the branches labelled according to Wood et al.'s analysis of genes that might be specific to eukaryotes versus prokaryotes, and to multicellular versus single-celled organisms. Bacteria and archaea are prokaryotes (they do not have nuclei). From Nature 415, 845-848 (21 February 2002) | doi:10.1038/nature725. The eukaryotic portion of the tree is based on Baldauf et al. 2000.

Anyway, I did a similar analysis to what was in the S. pombe genome paper and I found a reasonable number and helped write a section for the paper on this.

The list of genes is available as supplemental material on the Nature web site. Alas it is in MS Word format which is not the most useful thing. But more on that issue at the end of this post.

Thing 2. Searching for lineage specific duplications

Another aspect of comparative genomic analysis that I used to do for most genomes at TIGR was to look for lineage specific duplications (i.e., genes that have undergone duplications in the lineage of the species being studied to the exclusion of the lineages for which other genomes are available). The quick and dirty way we used to do this was to simply look for genes that had a better blast match to another gene from their own genome than to genes in any other genome. The list of genes we identified this way is also provided as a Word document in Supplemental materials.

Thing 3: Searching for organelle derived genes in the nuclear genome of P. falciparum

Thing 4: Analysis of DNA repair genes

Arnab Pain from the Sanger Center and I analyzed genes predicted to be involved in DNA repair and recombination processes and wrote a section for the paper:

Alas, I cannot for the life of me find what other parameters I used for the blastp searches. I am 99.9999% sure I used default settings but alas, I don't know what default settings for blast were in that era. And I am not even sure which version of blastp was installed on the TIGR computer systems then. I certainly need to do a better job of making sure everything I do is truly reproducible.


This all brings me to the actual real part of this story. Reproducibility. It is a big deal. Anyone should be able to reproduce what was done in a study. And alas, it is difficult to do that when not all the methods are fully described. And one should also provide intermediate results so that people to do not have to redo everything you did in a study but can just reproduce part of it. It would be good to have, for example, released all the phylogenetic trees from the analysis of organellar genes in Plasmodium. Alas, I do not seem to have all of these files as they were stored in a directory at TIGR dedicated to this genome project and as I am no longer at TIGR I do not have ready access to that material. It is probably still lounging around somewhere on the JCVI computer systems (TIGR alas, no longer officially exists . it was swallowed by the J. Craig Venter Institute . ). But I will keep digging and I will post them to some place like FigShare if/when I find them.

Perhaps more importantly, I will be working with my lab to make sure that in the future we store/record/make available EVERYTHING that would allow people to reproduce, re-analyze, re-jigger, re-whatever anything from our papers.

The key lesson - plan in advance for how you are going to share results, methods, data, etc .

Variability of coding sequences

After the observation that the clinical manifestation of iatrogenically induced malaria, employed in the treatment of Treponema pallidum infections, varied depending on the isolates obtained from different continents and resulted in species-specific and isolate-specific immunity, the concept of plasmodial diversity arose for the first time. Allozymes and antigenic proteins, in particular variants of the parasite's glucose phosphate isomerase (GPI) and LDH ( 21 ) were then the first plasmodial variants to be recognized. Variants of these enzymes were detected in individual malaria patients, proving the existence of mixed infections with differing P. falciparum clones in single individuals. Allelic variation and geographical differences of several genes were then demonstrated in a study on P. falciparum isolates from Asia, Africa and South America ( 30 ), and variant proteins could be distinguished with monoclonal antibodies.

LSA-1, a prime candidate for a subunit vaccine, is a 200-kDa protein expressed during the liver stage of P. falciparum and accumulated in the parasitophorous vacuole. Two epitopes of LSA-1 have shown to induce cytotoxic T-cell responses in carriers of particular HLA-B variants. The gene coding for LSA-1 has a large central repeat region and non-repetitive sequences encoding the N and C terminal ends of the protein. Sequence data of the coding gene are limited it is known, however, that non-synonymous mutations occur at the N-terminal and C-terminal ends.


Design and assembly of the synthetic gene

The design of the oligos used for synthesis of the 2.1 kb pfsub-1 gene necessitated great attention to detail, owing to the requirement for a large number to be mixed in one PCR. The nucleotide sequence of the gene was designed according to the P. pastoris codon usage preference ( Bennetzen and Hall, 1982 Sreekrishna et al., 1993). In addition, the panel of oligos was rigorously screened and matched in order to meet the following criteria: (i) a decrease in the overall A+T content with the elimination of potential transcription termination signals (ii) elimination of palindromic sequences conducive to stable intramolecular hairpins (iii) minimization of tandem or inverted repeats (<10 bp in length) which are likely to give rise to non-specific priming and (iv) optimization of the 20 nucleotide overlap between each 40-mer primer, to give a melting temperature in the range 58–62°C, in order to allow subsequent use of the primers for DNA sequencing. A Kozak concensus translation initiation sequence was incorporated in the extreme 5′ oligo for efficient expression of the gene in P. pastoris and an additional five histidine codons were introduced just prior to the stop codon in the extreme 3′ oligo. A number of unique restriction sites were introduced at strategic positions throughout the synthetic gene to facilitate subsequent gene manipulation and mutagenesis. Oligo design was performed with the aid of the Unix codon optimization program CODOP (see Materials and methods). This program translates a given DNA sequence into a protein sequence and then, using a user-defined codon usage table, back-translates the protein sequence with an improved codon usage. The program rejects codons with abundances below a cut-off value, then assigns a high-abundance codon to each residue in the protein sequence, using high abundance codons in proportion to their use in the codon usage table. Both strands of the sequence are then divided into overlapping oligos of 40 bases in length, melting temperatures are calculated for all the overlaps and restriction sites generated along the sequence are displayed. The resulting panel of oligos was then analysed using the Genetics Computer Group software package (GCG Version 8-Unix) for the presence of undesirable repeats, inverted repeats, stemloop structures and regions of complementarity which could potentially lead to non-specific intermolecular hybridization. In most cases these sequences were readily eliminated whilst maintaining the codon preference. Non-optimum codons were resorted to only if required to create unique restriction sites or at repetitive sequences. Systematic, reiterative use of these two programs resulted in the final selection of 104 unique oligos for gene synthesis. Table I shows a comparison of the codon composition of the synthetic gene with that of the wild-type P. falciparum gene. Codons not present in highly expressed yeast genes have been drastically decreased in frequency and a number of very rare codons eliminated. For example, 31 ATA (Ile) codons and 49 AAA (Lys) codons present in the native gene have been completely removed. The overall A+T composition has been reduced from 72% in the native gene to 53% in the synthetic product. The final, codon-optimized sequence of the synthetic pfsub-1 sequence and the relative positions of the 104 oligos is shown in Figure 1 together with the predicted amino acid sequence.

The initial assembly reaction (Figure 2 ) involved the construction of the full-length gene from a stoichiometric mixture of the 104 oligos. An aliquot of this assembly reaction mixture was then used as a template for the amplification process, in which only the two outermost primers of the assembly were added, at a concentration of 1 μM each. Optimum yields of the PCR products using Pfu DNA polymerase were obtained with 3 mM MgSO4 in the PCR. Analysis of the two PCRs on 1% agarose gels revealed the presence of the 2.1 kb expected product (Figure 3A ). In an alternative approach, the 5′ and 3′ `halves' of the gene were synthesized separately, in the form of two DNA fragments of 1.1 and 1 kb, respectively (Figure 3B ). The PCR conditions for the assembly reaction remained unchanged, although the number of the oligos in each assembly decreased to 56 and 50, respectively, for each reaction. The synthetic DNA products were blunt-end ligated into pMosBlue for cloning and sequencing. For sequencing reactions, primers hybridizing to sites

400 bp apart within the gene were chosen from the panel used in the synthesis, allowing coverage of both strands of the entire pfsub-1 sequence with consistent overlaps. Complete sequence analysis of three of the 2.1 kb clones and five of each of the smaller clones identified an average of only 3.5 nucleotide substitution errors per kb in the 2.1 kb PCR product and an average of only 1.5 error per kb in each of the two 1.1 and 1 kb products. These mutations were distributed randomly, suggesting that the oligos were not the source of the errors. Since Pfu polymerase, which exhibits a 3′ → 5′ proofreading exonuclease activity, was used for all amplification steps, we assume that these errors were most likely introduced during the assembly process. The reduced mutation frequency observed in the smaller products was probably a direct result of the reduction in the number of oligos mixed together during the assembly reaction.

Protein expression

Expression from the synthetic pfsub-1 gene was initially assessed in P. pastoris. The PfSUB-1 signal sequence was replaced by the pre-pro domain of the S. cerevisiae α-mating factor by cloning the gene into SnaBI /EcoRI-digested pPIC9K and the linearized vector used to transform P. pastoris. Transformants containing multiple chromosomal insertions of the recombinant vector were selected and, following preliminary inductions to select the highest producing clones, a single clone was induced in a 4 l fermenter. Since N-glycosylation of blood-stage P. falciparum proteins is rare ( Gowda et al., 1997), induction was performed in the presence of tunicamycin. No recombinant protein was secreted by the clone. Examination of total cell extracts by Western blotting showed that the induced recombinant product accumulated intracellularly in an insoluble form. Taking advantage of the C-terminal hexahistidine tag, the recombinant protein was purified under denaturing conditions from extracts of the induced clone by nickel chelate chromatography ( Holzinger et al., 1996) (Figure 4 ). From these purification data, the expression level of the recombinant PfSUB-1 was estimated at 0.2–0.5 g/l. N-terminal amino acid sequencing of the purified protein showed that the α-factor N-terminal secretory signal sequence had been removed whereas the α-factor pro domain was still present, suggesting that the protein had undergone translocation into the yeast ER but not been further processed.

The baculovirus system was next considered as an alternative expression system which might better support proper folding and post-translational processing of PfSUB-1. The codon usage of Autographa californica nuclear polyhedrosis virus (AcMNPV) is less stringent than that of P. pastoris ( Ranjan and Hasnain, 1994), so it was considered that our synthetic gene should also be well expressed in the baculovirus system. Infection of High Five TM insect cells in the presence of tunicamycin at 0.5 μg/ml resulted in secretion of readily detectable levels of apparently correctly processed PfSUB-1 (Figure 5 ). Preliminary purification runs with the secreted product indicated that expression levels of the recombinant protein were of the order of 2–5 mg/l (not shown).

Conclusions and perspectives

It is now clear that members from two protein families are involved in the host cell selection processes that define the red blood cell tropism of the malaria parasite merozoite. The diversity of these proteins within the genomes of different Plasmodium species allows to propose a hypothetical model where the combinations of ligands and the control of their expression could account for the distinct invasive behaviour of merozoites in the various species. However, neither conclusive evidence for this model nor the fine details of the interactions that underpin the selection of red blood cells for invasion are as yet at hand. There is little doubt that the ability to genetically modify the parasite has led to quantum advances in our perception of these invasive mechanisms, and that ongoing systematic analyses of parasites with disrupted RBL and/or EBL genes will yield further knowledge. This approach is unlikely to provide full insight into the nature of red blood cell selection, unless it is associated with studies where the red blood cell receptors are also defined. The red blood cell variants that are naturally found in nature constitute a rich source of material for these studies, but these are likely to be insufficient. For those parasites that can be maintained routinely in vitro (P. falciparum and P. knowlesi) the challenge will be to devise ways to obtain reproducibly homogenous population of red blood cells with defined receptor characteristics. Advances in the genetic manipulation of erythroblasts and in the generation of large numbers of erythrocytes from these progenitor cells might provide a solution.

The demonstration that switches to alternative invasion pathways where different ligands and receptors are implicated are easily obtained for P. falciparum under laboratory conditions, and evidence that this also occurs in parasites circulating in endemic residents, is of concern for malaria vaccines based on RBL or EBL proteins. The fact that parasites that do not express EBA175, the leading candidate for such vaccines, are still able to multiply ( Duraisingh et al., 2003b ) might translate in the selection of escape variants if this vaccine is deployed. However, pessimism must be tempered as it is not actually known whether these variant parasite lines that grow under laboratory conditions would be viable in vivo. It has nonetheless become necessary to consider inclusion of two or more RBL and EBL ligands in any vaccine intended to prevent interactions between the merozoite and the red blood cell.

Finally, one of the central assumptions concerning the RBL and EBL proteins is that they are solely involved in the invasion of red blood cells. This has primarily arisen from the fact that these proteins were first found associated to merozoites. Moreover, for many of these proteins binding to red blood cells could be demonstrated, and for some a red blood cell receptor was identified. Finally antibodies raised against a number of these proteins can inhibit or alter the tropism and invasive profiles of merozoites. There are two relatively recent studies that question the validity of this assumption. First, the expression of distinct subsets of Py235 genes (RBL family) was demonstrated (both at the transcriptional and protein levels) in the sporozoite and in the hepatic parasite ( Preiser et al., 2002 ). Second, EBA175 (EBL family), an important P. falciparum anti-red blood cell invasion vaccine candidate, was also shown to be expressed on the surface of sporozoites and in the infected hepatocyte ( Grüner et al., 2001 ). Whereas expression of these proteins might be expected in hepatic merozoites that are destined to invade red blood cells, their role in the biology of the sporozoite, a parasite form that interacts with mosquito salivary glands, the cells in the skin where it is deposited by the infective bite before invading hepatocytes, is yet to be explored. Systematic studies to determine which of the other members of the RBL and EBL families are expressed during the pre-erythrocytic stages of Plasmodium are currently underway. This might result in the identification of some ligands that specifically interact with a single host cell type, and others that might be part of the set of parasite proteins implicated in all invasive events.

In conclusion, elucidation of the molecular mechanisms underlying host cell tropism and invasion in Plasmodium parasites present researchers with a formidable challenge, both technically and intellectually. The resources that would be required to achieve this goal are justified by the central role these processes play in the survival of the parasite and in the possibility that the knowledge to be gained might yield novel and efficient strategies to control the infection.


Quantifying the diversity of major surface antigens underlying immune evasion of HIV 1 and 2 and Influenza A has been central to characterizing the transmission dynamics of these important human pathogens. In addition, documentation of variation data has provided a basis for the development of candidate vaccine targets [1,2]. Surprisingly, in-depth molecular epidemiological sampling and population genomic analyses of the var genes encoding the major blood stage surface antigen of the malaria parasite, Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1), has not been done. This is largely due to the inherent difficulties in the population genomic analysis of highly diverse multigene families. We set out to develop and evaluate a rapid, molecular epidemiological population genomic framework to investigate var gene diversity in natural parasite populations, due to the importance of these genes to the biology of P. falciparum.

To achieve chronic infection, malaria parasites evade the host immune response by switching PfEMP1 isoforms through differential expression of members of the var multigene family [3𠄵]. PfEMP1 is expressed on the surface of blood stage parasites known as trophozoites [6] and the transmission (early gametocyte) stages [7]. Parasite adhesion occurs in the deep vasculature of host tissues by binding of PfEMP1 to host endothelial cell receptors. Some PfEMP1-adhesion interactions are proposed to lead to severe disease manifestations such as cerebral and placental malaria (reviewed in [8]). Variant specific anti-PfEMP1 antibodies are believed to contribute to the regulation of parasite density in a manner that decreases the incidence of clinical disease [9�]. This immunity may reduce the duration of infection in a variant-specific manner to drive the dynamics of multiple infections in semi-immune children [14] and induced infections in humans [15]. Immunity to PfEMP1 can thereby influence transmission by reducing persistence. It may also reduce transmission by regulating the density of asexual blood stages with potential to become transmission stages and by directly targeting early gametocytes to prevent the maturation of transmission stages [16]. Consequently, diversity of PfEMP1 or var genes is able to promote transmission success by immune evasion.

Describing the diversity of var genes presents a more complex problem than assessing the diversity of the single copy major surface antigen genes of HIV and Influenza A. Individual P. falciparum genomes have repertoires of var genes that can recombine with other repertoires during the obligatory sexual phase of the life cycle in the mosquito [17�]. There is also circumstantial evidence for ectopic recombination among var genes within the same genome, possibly during both meiosis and mitosis [20�]. Therefore, there is enormous potential to generate diversity, even among closely related genomes. Var genes are large (5� kb) and complex [23�], encoding variable numbers and classes of the adhesion domains, Duffy binding like (DBL: α, β, δ, ɛ, and γ classes) and cysteine interdomain rich (CIDR: α, β, and γ classes) [26]. Both size and complexity preclude population genomic analysis of the full var gene sequences. The DBLα encoding domain was used previously as a marker of var genes in investigations of diversity [21,22,27�] and expression [31�]. The small size (𢏁 kb) and ubiquitous presence of DBLα among var genes [24�,34] make this domain a suitable population genomic marker.

Previous analyses of var gene diversity have examined DBLα domain sequences from the var gene repertoires of allopatric (distantly related) [22,27,30] or just a few sympatric (closely related) [28,29] isolates. These studies have established that any pair of DBLα sequences from the same genome were on average as diverse as any pair from two different genomes, with a range of 45%�% amino acid identity [3,22,27,28]. This has made it impossible to identify var gene orthologs among genomes. Limited overlap (shared DBLα sequences) among var repertoires from sympatric isolates has also been reported [27�], suggesting that many genomes must be sampled to see the extent of diversity of these genes in natural populations. Given the importance of var genes to transmission, a systematic sequencing effort and population genomic analysis is needed to examine var gene diversity in sufficient depth to estimate levels of antigenic diversity within natural populations.

A high-throughput population genomic framework was developed to address sampling issues specific to the molecular epidemiological analysis of diverse multigene families. This allowed the random sampling of the var gene repertoires of culture-adapted and field isolates of P. falciparum by sequencing DBLα domains as population genomic markers ( Figure 1 ). Var genes were sampled from a “global” collection of isolates, including clones 3D7 and HB3 used in genome sequencing projects ([34] D. Wirth, personal communication). DBLα sequences from these isolates were used to validate the framework. The “global” DBLα sequences were combined with available data from previous studies and compared to that obtained from a local population of Papua New Guinea (PNG). The results show immense levels of diversity among the var genes with strong evidence of geographic structuring of variation. We demonstrate patterns of similarity among sequences that suggest the widespread action of recombination in creating and maintaining diversity.

The cumulative diversity of DBLα was determined by comparing each new DBLα repertoire to previous repertoire(s) using a 96% DNA sequence identity cut-off. We plotted the number of accumulating “types” (distinct sequences) against the number of “sequences” compared. For example, the first point on the plot will be the number of types found in both isolates A and B on the y-axis against the total number of sequences compared on the x-axis. The next point will be the number of types found in isolates A, B, and C against the total number of sequences compared. The plot shown here is the average curve of 1,000 permutations of the order that isolates were compared (i.e., isolates A, B, then C or A, C, then B or B, C, then A).

Author information


Department of Immunology and Infectious Disease, Harvard School of Public Health, 665 Huntington Avenue, Boston, 02115, Massachusetts, USA

Sarah K. Volkman & Dyann F. Wirth

Broad Institute, 7 Cambridge Center, Cambridge, 02142, Massachusetts, USA

Sarah K. Volkman, Daniel E. Neafsey, Stephen F. Schaffner, Daniel J. Park & Dyann F. Wirth

School for Nursing and Health Sciences, Simmons College, 300 The Fenway, Boston, 02115, Massachusetts, USA

Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, 02138, Massachusetts, USA