How to build a phylogenetic tree without an outgroup?

How to build a phylogenetic tree without an outgroup?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have whole genome aligned sequences of four beetle populations of same species. I wish to construct a phylogenetic tree with the four. However, I am unable to find a suitable outgroup of the species, so I cannot use an outgroup to root the tree. Is there a particular method that one can use to construct a tree without an outgroup? I found that the MEGA software does this well. What is the theory and understanding of constructing a tree without an outgroup? what are the implications of constructing a tree without an outgroup? Ar the distances between trees measured relative to one another?

Most classic phylogeny reconstruction algorithms root the tree a posteriori, based on the outgroup chosen by the user. The tree is actually inferred and internally represented without root.

Therefore, if you use a program that asks you an outgroup, it is likely that you can just choose an arbitrary one and later "de-root" the obtained tree.

Student construction of phylogenetic trees in an introductory biology course

Phylogenetic trees have become increasingly essential across biology disciplines. Consequently, learning about phylogenetic trees has become an important component of biology education and an area of interest for biology education research. Construction tasks, in which students generate phylogenetic trees from some type of data, are often used for instruction. However, the impact of these exercises on student learning is uncertain, in part due to our fragmented knowledge of what students construct during the tasks. The goal of this project was to develop a more robust method for describing student-generated phylogenetic trees, which will support future investigations that attempt to link construction tasks with student learning.


Through iterative examination of data from an introductory biology course, we developed a method for describing student-generated phylogenetic trees in terms of style, conventionality, and accuracy. Students used the diagonal style more often than the bracket style for construction tasks. The majority of phylogenetic trees were constructed conventionally, and variable orientation of branches was the most common unconventional feature. In addition, the majority of phylogenetic trees were generated correctly (no errors) or adequately (minor errors only) in terms of accuracy. Suggesting extant taxa are descended from other extant taxa was the most common major error, while empty branches and extra nodes were very common minor errors.


The method we developed to describe student-constructed phylogenetic trees uncovered several trends that warrant further investigation. For example, while diagonal and bracket phylogenetic trees contain equivalent information, student preference for using the diagonal style could impact comprehension. In addition, despite a lack of explicit instruction, students generated phylogenetic trees that were largely conventional and accurate. Surprisingly, accuracy and conventionality were also dependent on each other. Our method for describing phylogenetic trees constructed by students is based on data from one introductory biology course at one institution, and the results are likely limited. We encourage researchers to use our method as a baseline for developing a more generalizable tool, which will support future investigations that attempt to link construction tasks with student learning.

M aterials and M ethods

Taxon and Character Sampling

To test the relationships of the thalattosuchian crocodylomorphs, I performed a phylogenetic analysis of 394 morphological characters scored for eight outgroup and 78 ingroup taxa, including 24 thalattosuchian species (online Appendix 1 available as Supplementary Material on Dryad at This new data set is a modified version of that presented in Wilberg (2015) with the addition of 10 new characters and the modification of many others (online Appendix 2 available as Supplementary Material on Dryad at To minimize errors in character coding, I focused ingroup sampling on specimens I could observe firsthand or those with detailed published descriptions. I made an effort to sample widely from all major Crocodylomorph groups. Taxon sampling within Thalattosuchia focused on capturing the broad range of morphologies present in the group across their entire temporal duration. Outgroup sampling was increased from previous analyses with the intent of better characterizing the distribution of character states in noncrocodyliforms. The basal suchian Gracilisuchus was used to root the tree based on its position in the broad scale analysis of Archosauria by Nesbitt (2011). The rauisuchid (sensu Nesbitt 2011) Postosuchus kirkpatricki was included for two primary reasons. First, Rauisuchidae has frequently been recovered as the sister group to Crocodylomorpha, just outside the phylogenetically unstable “Sphenosuchia” (e.g., Benton and Clark 1988 Parrish 1993 Juul 1994 Nesbitt 2011). Second, Postosuchus kirkpatricki is well known from multiple specimens representing nearly the complete skeleton allowing the scoring of most characters. Six “sphenosuchian” taxa were also sampled. Three of these have been recovered as the sister taxon to Crocodyliformes in previous analyses (Junggarsuchus sloani, Clark et al. 2004 Kayentasuchus walkeri, Nesbitt 2011 Almadasuchus figarii, Pol et al. 2013). The inclusion of these taxa will provide a more stringent test of the potential placement of Thalattosuchia as the sister group to Crocodyliformes. To assess the sensitivity of the topology to outgroup sampling, the analysis was also run in three permutations: Excluding the basal suchian Gracilisuchus (rooting on Postosuchus) excluding the noncrocodylomorph taxa Gracilisuchus and Postosuchus (rooting on Hesperosuchus agilis) and excluding all noncrocodyliforms and rooting on the protosuchian Orthosuchus stormbergi as in some published analyses (e.g., Sereno and Larsson 2009).

As with any paleontological phylogenetic analysis, the study data set contains relatively high amounts of missing data (40.75% missing or inapplicable). Much of the missing data is concentrated in the postcranial characters as numerous crocodylomorph taxa are known primarily from cranial material. Three taxa (Zaraasuchus shepardi, Eoneustes gaudryi, and Steneosaurus brevidens) are highly incomplete (80–82%), whereas median incompleteness per taxon is ∼36%. However, while missing data has been shown to reduce phylogenetic accuracy (e.g., Wiens 2003 Prevosti and Chemisquy 2010 and references therein), the quantity of missing data does not directly correlate with the information content of a taxon. A highly incomplete taxon may still increase resolution if it contains informative synapomorphic information ( Kearney and Clark 2003 Wiens 2003).

Parsimony Analysis

The phylogenetic data set was analyzed in TNT v1.1 ( Goloboff et al. 2008) using equally weighted parsimony. Minimal length trees were found using a heuristic search with 1000 replicates of Wagner trees using random addition sequences followed by tree bisection and reconnection (TBR) branch swapping. The shortest trees obtained from these replicates were subjected to a final round of TBR branch swapping to ensure all minimum length trees were discovered. Zero length branches were collapsed if they lacked support under any of the minimal length trees (Rule 1 of Coddington and Scharff 1994). Two separate analyses were run. In the first, to test the effect of potentially nested sets of homologies present in some multistate characters, 36 characters were treated as ordered (online Appendix 2 available as Supplementary Material on Dryad at In the second, multistate characters were treated as unordered to avoid making a priori assumptions about the process of evolution (though whether treating such characters as unordered involves better justified assumptions has been questioned e.g., Lipscomb 1992 Slowinski 1993).

Nodal Support

Nodal support was assessed using jackknife resampling as applied to character data ( Farris et al. 1996). Jackknife support was calculated in TNT using 1000 replicates with the probability of independent character removal set at 0.37 (∼e −1 as recommended in Farris et al. 1996). A heuristic search was employed with each replicate consisting of 10 random addition sequences, saving 10 trees per replicate. The resulting topologies were summarized using GC frequencies (difference between the frequency of recovering a given group and the most frequent contradictory group Goloboff et al. 2003). GC frequencies are preferred over absolute frequencies (the standard method of counting frequencies in bootstrap and jackknife analyses) because they account for the evidence in support of a clade as well as the amount of evidence falsifying that clade.

Comparative Matrices

To assess the effect of outgroup sampling on tree topology, two previously published crocodylomorph taxon-character matrices ( Turner and Buckley 2008 Sereno and Larsson 2009) were investigated. The analysis of Turner and Buckley (2008) consists of 75 taxa and 290 characters and includes Gracilisuchus stipanicicorum, Terrestrisuchus gracilis, and Dibothrosuchus elaphros as outgroup taxa (rooted on Gracilisuchus). The analysis of Sereno and Larsson (2009) includes 43 taxa and 252 characters (rooted on the protosuchian Orthosuchus stormbergi). Both matrices were unaltered with the exception of the addition of new outgroup taxa. In the case of Turner and Buckley (2008), the single terminal taxon Postosuchus kirkpatricki was added. For comparative purposes, both Postosuchus and Gracilisuchus were added to the data set of Sereno and Larsson (2009). These data sets were analyzed using unweighted parsimony in TNT v. 1.1 and the same search parameters described above. Both analyses incorporated additive characters, and these were retained as such. Gracilisuchus was set as the root for both matrices. All phylogenetic data sets are available as Supplementary Material on Dryad at

How to build a phylogenetic tree in Geneious Prime

Phylogenetic trees are used to infer evolutionary relationships among sequences. Geneious can build phylogenetic trees using distance, maximum likelihood or Bayesian methods. This guide describes the basic steps to build a tree and manipulate the tree viewer in Geneious.

Before you embark on building your tree, you should familiarize yourself with the principles of tree-building and the strengths and weaknesses of each method. The review below is a good place to start.

1. Align your sequences

Before you can build a phylogenetic tree, you need to align your sequences. To do this, select all your sequences and choose Align/Assemble - Multiple Alignment. This link provides a guide to the available algorithms.

Once you are happy with your alignment, select it and click Tree to open the tree building options.

2. Choose your tree builder and parameters

At the top of the tree building options you’ll see the available tree building algorithms. This includes the built-in Geneious Tree Builder (and Consensus Tree builder), and any plugins you have installed.

The Geneious Tree Builder produces distance trees using either Neighbor-Joining or UPGMA methods. In addition, the following plugins are available for producing maximum likelihood, parsimony or Bayesian trees:

RAxML - Maximum likelihood, optimized for large datasets

FastTree - Approximate maximum likelihood, for extremely large datasets

PAUP* - Parsimony or maximum likelihood (requires your own copy of PAUP*, either version 4.0b10, or 4.0a149 and above from here )

More information on the maximum likelihood tree-builders is available at this link .

Each tree builder has a different interface for specifying the evolutionary model and other parameters. We suggest you consult the user manual for each tree builder to familiarize yourself with the available options. You may also wish to use a program like Modeltest outside of Geneious to determine the best model for your data prior to building the tree.

3. Run your tree

Click OK to start your tree building. The time it takes to build the tree will vary depending on the algorithm you have chosen, the size of your alignment and the parameters (such as number of bootstrap replicates) you have chosen. Distance trees normally complete fairly quickly (within minutes), but maximum likelihood and Bayesian trees may take hours or even days to run.

4. View your tree

When you tree has finished running, a new tree document will be created and it will automatically open in the viewer. By default, Geneious displays trees in rectangular (rooted) layout, even if the tree is unrooted. Options for circular or radial formats can be found under the General tab, along with the Zoom controls.

If you wish to root your tree, click on the node of the taxon you wish to specify as the outgroup and click Root. To flip the position of taxa vertically, without changing the topology, use the Swap Siblings option.

The controls at the top of the viewer also contain options for coloring and setting the font sizes on your tree. To color an entire clade, select the node at the base of the clade and select Color Nodes.

1 Answer 1

You need bootstrap support using a model-based tree building algorhithm, via maximum likelihood (a few people use Bayes). The file format is relaxed phylip format (please submit a separate question if you have difficulties here - its a bit tricky).

I use standard RAxML here, specifically raxmlHPC (easily downloadable and compiles on Linux and OSX). The codes are quite complicated and I've given them below.

A robust maximum likelihood tree is,

This tree will bootstrap for 500 replications, however to start I would use 100 replications.

Make a consensus tree of the bootstraps,

You require bootstrap support >80% and please repeat this with and without 5-2/5-3 (it still looks long)

The cluster you have access to is fine for the calculation, it will take aound 24 hours to complete one bootstrap calculation for one data set and obviously you need to parallelise you calculation across the 22 contigs.

Viewing the tree, FigTree (for Mac OSX) is easy.

Rooting can be complicated because I don't really know your bacteria.

The recombination issue is more complicated, but I would construct 22 trees from your contigs and assess them for congruence. Panmixia is a concerned, which means too much recombination

How to Read a Phylogenetic Tree

It has been over 50 years since Willi Hennig proposed a new method for determining genealogical relationships among species, which he called phylogenetic systematics. Many people, however, still approach the method warily, worried that they will have to grapple with an overwhelming number of new terms and concepts. In fact, reading and understanding phylogenetic trees is really not difficult at all. You only need to learn three new words, autapomorphy, synapomorphy, and plesiomorphy. All of the other concepts (e.g., ancestors, monophyletic groups, paraphyletic groups) are familiar ones that were already part of Darwinian evolution before Hennig arrived on the scene.

Dan Brooks and I teach a biodiversity course (EEB 265) to second year students at the University of Toronto. The entire course is structured around a phylogenetic framework. We begin with the big, albeit simplified, tree of the Metazoa, then work our way from sponges to snakes, focusing on the characters that bind groups together and the characters that make each group unique. If we are doing our job correctly, our students should be able to answer the following questions—what is this animal (how do you know)? What does it do? What makes it special? What aspects of its biology make it vulnerable to anthropogenic intervention? Since all of the students had already taken a lab in first year biology covering the fundamentals of phylogenetics, we assumed that we wouldn’t need to review phylogenetic methodology in our biodiversity course. It didn’t take long for us to realize that our assumption was naïve by the time many of the students had arrived in EEB 265, they had already hit the delete button next to “phylogenetics” in their brain. It is always humbling to (re)discover that not everyone shares your views about the things in life that are interesting and important!

Back to the drawing board. One of the major problems with teaching a course about metazoan diversity is that you simply don’t have enough time to cover all of the groups. The last thing we wanted to do was to sacrifice biology-based lectures for a discussion about theory. So, the challenge was simple: design a lecture that would, in 50 minutes, teach students how to understand what a phylogenetic tree was telling them. It wasn’t our intention to teach students how to make trees, just how to read them. This paper is based on that lecture.

The word “phylogeny” is a combination of two Greek words, phyle (tribe—in particular, the largest political subdivision in the ancient Athenian state []: another word we get from this is “phylum”) and geneia (origin []: another word we get from this is “gene”). It was coined by the developmental biologist Ernst Haeckel in 1866 and then championed by Darwin in his famous work, On the Origin of Species (beginning with the 5th edition in 1869). Both biologists tied the idea of “phylogeny”—the origin of groups—to evolution. Phylogenetic trees are thus simply diagrams that depict the origin and evolution of groups of organisms.

Although you might not know it, we are all familiar with the idea of phylogenetic trees. People have been making such trees for decades, substituting the word, “family” for “phylogenetic” (Fig. 1). Just as individual people in a family over generations are connected by bonds of “blood” (the process of reproduction that produces offspring), individual species are connected by evolutionary ties (biological processes like natural selection and geological processes such as continental drift or a river changing course that produce species). In this sense, speciation (the production of new species) = reproduction (the production of new individuals). In other words, we are all, from members of the same family to members of the same species, connected by genes.

Family tree for an interesting group of people. In phylogenetic terms, family trees (genealogies of people) = phylogenetic trees (genealogies of species)

Family trees tend to be drawn as if they were hanging upside down, like a cluster of grapes. Phylogenetic trees are depicted somewhat differently. Imagine that you are holding the family tree for the big cats shown in Fig. 2a. Now, flip it sideways (rotate 90° counterclockwise) and you have the image shown in 2b. Rotate this image yet another 90° counterclockwise, smooth it out, and you have the image shown in Fig. 2c (this tree shape was the one used by Darwin in On the Origin of Species). The important thing to remember is that all three depictions are saying exactly the same thing about the relationships among species of big cats. How you choose to draw your phylogenetic trees depends, in part, on personal preference—some people find it easier to read 2b, others prefer 2c.

ac So many ways to draw a family/phylogenetic tree for the genus Panthera

Phylogenetic trees are reconstructed by a method called “phylogenetic systematics” (Fig. 3). This method clusters groups of organisms together based upon shared, unique characters called synapomorphies. For example, you share the presence of a backbone with cats, but not with butterflies. The presence of a backbone thus allows us to hypothesize that human beings are more closely related to cats than they are to butterflies (Fig. 4a) cats and people both have a backbone, butterflies are spineless Footnote 1 . Not all characters are synapomorphies. Some traits, called plesiomorphies, are shared by all the members of a group. Returning to our tree, we see that cats, people, and butterflies all have DNA (Fig. 4b). The presence of DNA allows us to hypothesize that these three species are all part of the same group, but it does not tell us anything about how those species are related to one another. Think of it this way: my last name tells me that I am part of the McLennan clan. If I meet someone called Jessie McLennan, I know we are related somehow, but I haven’t any idea whether she is a long lost cousin or someone from a more distant branch of the family tree. The final term you need to know is autapomorphy—traits that are only found in one member of the group. For example, butterflies can be distinguished from cats and people because they have an exoskeleton made out of chitin (a tough, waterproof derivative of glucose). Autapomorphies help us identify a particular species in a group but, like plesiomorphies, they tell us nothing about relationships within the group. Overall these three types of characters can be likened to the story of Goldilocks: plesiomorphies are too hot (too widespread), autapomorphies are too cold (too restricted), and synapomorphies are just right (for determining phylogenetic relationships).

The basis of phylogenetic systematics

Identifying types of characters on a phylogenetic tree. a a synapomorphy b a plesiomorphy c an autapomorphy

Enough of characters for the moment back to the trees themselves. Why do the branches on a tree have names (e.g., lion, tiger, etc.), while the lines joining different branches together do not (Fig. 5)? This is because these lines represent ancestors. An ancestor is a species that has undergone a speciation event to produce descendant species. The ancestor usually “disappears” in the process of speciation. Does this mean that the ancestor goes extinct?

Finding ancestors on a phylogenetic tree

In order to answer this, we must do some time traveling carrying a digital device that records everything we see (Fig. 6). Imagine you travel back 10,000,000 years, then stop, intrigued by an interesting species of lizard with red spots all over its back (species A). After a while, you decide to move forward in time five million years or so then stop again. You search around and discover two new lizard species, one with blue spots on its back (species B), and the other with red stripes (species C), but species A is nowhere to be seen. Did it go extinct? You look back over your digital recording of those five million years and discover that species A split into two groups, which became different in some ways from one another through time. In evolutionary terms, species A is an ancestor (ancestor 1) and species B and C are its descendants. Fast forward to today (with more digital material to watch) and you find three species of lizard: your old friend the blue spotted lizard (species B) and two new lizards (descendants of species C, the red striped lizard), one with blue stripes (species D) and the other with a solid black back (species E). Today, then, there are only three species of lizard alive. You no longer see either of the ancestors (the red spotted and red striped lizards), but we still show them on the phylogenetic tree.

Traveling back in time to discover ancestors

The answer to our original question “did the ancestor go extinct?” is thus No! In many cases, the ancestor is subdivided and the biological (genetic) information encompassed within the ancestor is passed on to the descendant species. Over time, the descendants change and become different in some ways from each other and from the ancestor, while retaining some things in common (for example, all of our lizard species have a backbone). This is evolution.

So what really counts as extinction? Extinction is the loss of biological information—the physical loss of a species. For example, consider a simplified phylogenetic tree of the dinosaurs (Fig. 7). All of the groups on dotted branches are extinct—none of the species in those groups exist on this planet anymore (Jurassic Park notwithstanding), which means that all of the information that was unique to each of those groups has been lost. The only group that managed to avoid extinction was Aves (or birds)—avian species are the last remaining dinosaurs.

Actual extinctions. Groups depicted with dotted lines are extinct so all of the genetic, morphological, physiological, ecological, and behavioral traits that are unique to each group have been lost to the biosphere

OK, let’s take what we have learned about ancestors and clustering groups based on shared, unique characters (synapomorphies) and use that to decipher the information contained within a phylogenetic tree. Here is a tree depicting the relationships among living members of the Amniota, a large group of vertebrates that includes most of the animals with which you are familiar (Fig. 8). You already know that the names of species, or groups of species, are written across the tips of the branches on the tree. The next thing you need to know is that characters are depicted at their point of origin on a phylogenetic tree. So, on this tree you can see that (1) the amniotic egg originated in ancestor 1 and was passed on to all of its descendants (mammals, ancestor 2, turtles, ancestor 3, ancestor 4, crocodiles, birds, ancestor 5, tuataras, and lizards plus snakes). In evolutionary terms, the amniotic egg is a unique trait that is shared only by ancestor 1 and all of its descendants (2) a special type of skin protein (β keratin) originated in ancestor 2 and was passed on to all of its descendants (turtles, ancestor 3, ancestor 4, crocodiles, birds, ancestor 5, tuataras and lizards plus snakes). β keratin is a unique trait shared by the group called “Reptilia” and (3) a breakable tail originated in ancestor 5 and was passed on to all of its descendants (tuataras, lizards plus snakes). A breakable tail is a unique trait shared by members of the group tuataras + lizards + snakes.

How to read characters on a phylogenetic tree

In fact, every organism is a complex mosaic of thousands of traits. If you don’t believe this, sit down and list all of the traits that make you, you. In addition to the obvious things like eye color and hair color, don’t forget the fact that you have RNA, DNA, individual cells, an anterior and posterior end, a skull, jaws, bone, arms and legs, come from an amniotic egg, have three bones in your inner ear, were suckled on milk produced in mammary glands, have an opposable thumb, and no tail. In other words, when you look at a phylogenetic tree, you will see that all of the branches have at least one, and more likely many, characters on them (the slash marks on Fig. 9a). Because of this, it is often difficult to actually label all of the traits on a tree because it’s visually distracting. A shorthand method has been developed to deal with this problem: draw the tree showing the relationships among the groups (Fig. 9b) and list the synapomorphies for each branch elsewhere in a table. On the other hand, if you are interested in one or more particular traits, you can highlight them on the phylogenetic tree without showing all the other characters. For example, if you wanted to discuss the evolution of mammals, you could show the amniote tree and highlight just the synapomorphies for the mammals (e.g., three middle ear bones: Fig. 9c). Remember, this is just shorthand!

ac Representing characters on a phylogenetic tree

There is one last thing about characters that is important to understand: characters are not static things. They evolve through time. In other words, a “synapomorphy” may not “look the same” in all species that have it. So, for example, consider the stapes, one of the three bones in your middle ear that are responsible for transferring sound waves from the eardrum to the membrane of the inner ear. This small bone has a long, complicated, and fascinating evolutionary history. To understand that history, we must travel back many of hundreds of millions of years to the origin of the Deuterostomes, a large group that includes the Echinodermata (starfish and their relatives), Hemichordata (worm-like, marine creatures), and Chordata (amphioxus + tunicates + Craniata [organisms with skulls]). The ancestor of this large group had numerous slits in its pharynx (called visceral arches) that were involved with filter feeding. Time passed and cartilaginous rods providing support for the arches appeared, were subdivided and modified. The upper section of the second visceral arch rod is the focus of our tale (Fig. 10). As we move forward still further in time, this character undergoes various structural and positional modifications in essence, it becomes larger, more robust, and involved in supporting the jaws (at which point it is called the hyomandibula), changes from cartilage to bone, then begins a gradual reduction in size, disengages from the jaw/cheek area, and moves into the middle ear (at which point it is called the stapes). Overall then, the upper portion of the 2nd visceral arch—hyomandibula—stapes is the same structure that has had both its shape and function modified over hundreds of millions of years. So although the presence of a “cartilaginous rod in the 2nd visceral arch found in the throat region” may be a synapomorphy for the Craniata, you won’t find that exact structure in any four-footed animals. Instead, what you will find is the modification of that cartilaginous rod, the stapes. The continued evolution of a particular character past its point of origin is called an evolutionary transformation series.

Synapomorphies are not static they may continue to evolve. Changes in the character “upper portion of the second visceral arch” [hyomandibula, stapes] are traced on the phylogenetic tree for the Chordata (animals with notochords). Both the story and the phylogenetic tree have been substantially simplified to emphasize the idea of character origin and modification rather than the finer details of character evolution. Names in italics refer to extinct species known from fossils. Line drawings and photographs of various structures and species can be found easily on the web

The next thing that students of phylogenetics have to know is how to recognize different kinds of groups of organisms. There are two general types of groups, one “good” and the other “bad”.

Let’s begin with “the good,” a monophyletic group (Fig. 11). The word “monophyletic” is a combination of two Greek words, monos (single) and phyle (tribe). It was coined by our old friend Ernest Haekel, who, as you remember, also invented the word phylogeny. A monophyletic group includes an ancestor and all of its descendants. It is identified by the presence of shared, unique characters (synapomorphies). Each phylogenetic tree contains as many monophyletic groups as there are ancestors. For example, looking at the tree in Fig. 11, we can identify five monophyletic groups, only two of which are shown on Fig. 12 (I’ll leave it up to you to discover the other three).

Identifying monophyletic groups

Two of the five monophyletic groups on the hypothetical tree

Now onto “the bad.” The word “paraphyletic” is, once again, a combination of two Geek words, para (near) and phyle (tribe), so the implication is that the whole tribe is not present (Fig. 13). Paraphyletic groups include an ancestor but not all of its descendants. On this hypothetical tree, species C has been eliminated from the group, even though it is a descendant of ancestor 1 just like the rest of the species. Paraphyletic groups are problematic because they mislead us about how characters evolve and how species are related to one another. For example, let’s consider the big tree for the Amniota and highlight the “old” Reptilia, one of the most famous paraphyletic groups (Fig. 14). Even today people still speak about three distinct classes, the reptiles, the birds, and the mammals. When you look at this figure, what is wrong about the class Reptilia, the way it is drawn?

Identifying paraphyletic groups

The most famous paraphyletic group, the reptiles

Right! In (Fig. 15) Ancestor 2 is the ancestor of all the reptiles but, as highlighted on this figure, the Reptilia does not include all of ancestor 2’s descendants ancestor 4 and the birds have been removed from the group. The only way to make the Reptilia a monophyletic group is to redefine the term to include crocodiles, turtles, tuataras, lizards, snakes, and birds. In the past, birds were not considered to be reptiles because they are warm-blooded (in fact, they were often grouped with mammals because of that trait). But phylogenetic studies have demonstrated that birds are indeed reptiles because they share many morphological, behavioral, and molecular characters with other reptilian species in general (synapomorphies originating in ancestor 2 e.g., β keratin), and they share many characters with crocodiles in particular (synapomorphies originating in ancestor 4 e.g., holes in the skull just in front of the eyes).

How to make the Reptilia monophyletic

Why is it important to have monophyletic groups? Say you wanted to figure out how red hair appeared in your family. What would be your chances of tracking down your original red-haired ancestor if no records were kept about the union between your great-great-great-great grandfather Sven and his Irish bride Maggie? Missing information creates problems for any research, be it genealogical or evolutionary, and paraphyletic groups are missing information. In evolutionary terms, monophyletic groups are “real” biological units that is, they are the product of descent with modification (an ancestor and all of its descendants) and as such can be used to study the evolutionary processes that produced them. Paraphyletic groups, on the other hand, are the product of “human error” arising from incomplete or flawed information (e.g., poor descriptions of characters). Using such groups to study evolutionary processes will direct us along misleading and confusing pathways.

Why do we use phylogenetic trees? There are many ways to answer this question (and many papers/books written about it), but the most general answer is that trees summarize valuable information about the evolution of organisms that allows us to understand them better. For example, here’s the family tree for the Hominoidea, the group that includes us and all of our closest relatives (Fig. 16). When you look at the distribution of characters on this tree you can see that a number of traits we associate only with human beings, such as hunting, infanticide, tool making, self-awareness, and language, originated long before Homo sapiens. In other words, human beings are not as unique as you might think. If we want to understand how and why those traits evolved, we must study their expression and function in ourselves and in our relatives. So much information from just one phylogenetic tree!

Phylogenetic Trees Tutorial

Investigate the evolutionary origins of HIV

Note: To complete the tutorial with the referenced data please download the tutorial above and install in Geneious Prime.

In this tutorial, you will use Geneious Prime to investigate the evolutionary origins of human immunodeficiency viruses (HIVs) using molecular phylogenetic tools. You will learn how to align sequences and build a phylogenetic tree, as well as how to view and manipulate the tree to answer questions on the origins of HIV-1.

Introduction: Human and Simian Immunodeficiency Viruses

HIVs, the causes of acquired immune deficiency syndrome (AIDS), are closely related to simian (monkey and ape) immunodeficiency viruses (SIVs). These and other similar viruses are retroviruses. Retroviruses are characterised by their RNA genomes, which once inside a host cell, are reverse transcribed into DNA and then integrated into the host cell’s genome. The integrated viral genome is known as a provirus. You will be working with proviral DNA sequences.

The origins of HIVs were mysterious when these viruses were first discovered in the early 1980s. There are two types of HIVs. HIV type 1 (HIV-1) is more widespread and causes more severe disease than HIV type 2 (HIV-2). HIV-1 is also far more diverse than HIV-2. HIV-1 is classified into three major groups: M, N, and O. The viruses causing the AIDS pandemic (widespread epidemic) belong to Group M. Group M is subdivided into several subtypes. You will be analysing sequences from HIV-1 Group M Subtypes A, B, C, D, F, G, H, J, K. The HIV-1 viruses infecting people in North America, Europe and Australia are mostly from Group M Subtype B. All groups and subtypes of HIV-1 and HIV-2 are found in Africa.

Both HIV-1 and HIV-2 are closely related to SIVs found in a variety of African primate species. This lead early on to researchers hypothesising that HIVs had jumped to humans from one or more African primate species. It was suggested that close contact between humans and monkeys that were kept as pets or hunted for food had allowed the SIVs to jump hosts.

More information on HIV can be found on this Wikipedia page.

In this tutorial you will use molecular phylogenetics to determine the evolutionary relationships of HIVs and SIVs, and so determine from which African primates HIVs originated. In Exercise 1 you will build an alignment of the HIV and SIV sequences, then in Exercise 2 you will learn to build a basic phylogenetic tree. Exercises 3 and 4 provide questions and answers to further your understanding on interpreting phylogenetic trees.

SIV sequences and primate taxa

The sequences in this tutorial come from various African primate species known to be infected with different SIVs. There are also three non-African species, all from Asia, that have been infected with SIVs in captivity: the pig-tailed macaque, the rhesus macaque and the stump-tailed macaque. The SIVs from all of these primate species are referred to by the three-letter code given with each picture. For example, the SIV from the sooty mangabey is called SIVSMM and the sequence in the alignment or tree is labelled SIV-SMM.

Mona monkey
Cercopithecus mona mona [denti]

de Brazza’s monkey
Cercopithecus neglectus

Tantalus monkey
Chlorocebus tantalus

Syke’s monkey
Cercopithecus albogularis

Greater spot-nosed monkey
Cercopithecus nictitans

Green monkey
Chlorocebus sabaeus

Mustached guenon
Cercopithecus cephus

Vervet monkey
Chlorocebus pygerythrus

Chlorocebus aethiops

L’Hoest’s monkey
Cercopithecus lhoest

Sooty mangabey
Cercocebus atys

Red-capped mangabey
Cercocebus torquatus

Sun-tailed monkey
Cercopithecus solatus

Mandrillu sphinx

Mandrillus leucophaeus

Pig-tailed macaque
Macaca nemestrina

Stump-tailed macaque
Macaca arctoides

Rhesus macaque
Macaca mulatta

Common chimpanzee
Pan troglodytes

Exercise 1: Multiple alignment of HIV and SIV sequences

Before a phylogeny can be constructed, the sequences must be aligned. The objective of sequence alignment is to maximize the similarity between sequences, inserting gaps in sequences where necessary to improve the overall alignment.

Multiple alignment algorithms use a scoring system where sequence matches and mismatches for each site are assigned a value, and gaps are penalized. The insertion of gaps in an alignment can increase the similarity of the surrounding bases, so the overall alignment score is a trade-off between the increased match/mismatches scores and the cost of opening and extending a gap.

In this exercise you will construct an alignment of 62 env sequences of HIV-1, HIV-2, and various SIVs. The SIV sequences come from various African and non-African primate species.

The env gene is found in all retroviruses. It codes for two viral envelope glycoproteins that are positioned on the virion surface and interact with host cell-surface receptors.

Click on ‘HIV_sequences’ to view the sequences.

The sequences are labelled in the format: virus type followed by the common name of the primate species for the SIV sequences, or the group or subtype for HIV-1 and HIV-2 sequences finally followed by the accession number.

To align these sequences, go to Align/Assemble -> Multiple Align. Geneious has 3 different alignment programs built in (Geneious aligner, MUSCLE, and Clustal Omega), plus a plugin for the MAFFT aligner is available. For further information on these aligners please see this article. We will use the MUSCLE aligner for this example, as it is suitable for a medium sized dataset.

Select MUSCLE alignment from the alignment options. We will use the default parameters, so click on the settings cog in the bottom left of the window and choose Reset to defaults (if it is greyed out, the default parameters are already set). Click the More Options button to view the parameters if you wish. Click OK to start the alignment – it may take several minutes to complete.

Once the alignment has completed, click on it to view it and zoom in to see the bases. Note that there are many large gaps, which is characteristic of an alignment of a rapidly evolving gene in divergent species.

Exercise 2: Build a Phylogeny of HIVs and SIVs

In this exercise you will construct a phylogeny using the Neighbour-Joining tree building method and the Tamura-Nei model. Models of evolution describe expected frequencies of each nucleotide and the rate of change between nucleotides. The Tamura-Nei model assumes each base has a different equilibrium frequency and allows transitions and transversions to occur at different rates. It allows the two types of transitions (A ↔ G and C ↔ T) to have different rates. This is useful when analysing HIV sequences because HIV exhibits hyper G-to-A mutation caused by a host enzyme (APOBEC3G). You will use the Neighbour-Joining method because these sequences do not, in general, evolve in a clock-like manner.

Select the alignment you created in Exercise 1.

To construct a Neighbour-Joining tree using the Tamura-Nei model, with bootstrapping, click the Tree button and select the Geneious Tree Builder. Check that the default parameters are initially set by clicking Reset to Defaults.

For the genetic distance model select Tamura-Nei and for the tree build method select Neighbor-Joining. Set the outgroup to “SIV-MON Mona monkey AY340701”. This sequence will be used to root the tree.

To calculate support values for the tree use bootstrapping. To do this, tick the box next to Resample tree and select Bootstrap in the dropdown box next to resampling method. Set number of replicates to 100 and the support threshold to 0.

The tree building options should now look similar to this:

Click OK to build the tree.

Once the tree builder completes, the tree document will appear in the document table in Geneious and should open automatically.

Viewing and Manipulating Phylogenetic Trees

A phylogenetic tree is a branching diagram of evolutionary relationships. It contains information about the order of evolutionary divergences within, and hence about the relationships among, a group of organisms. It can also contain information about the amount of evolutionary change which occurred between any two branching events. The lines on the the tree are called branches and the intersections of these lines are called nodes. A node represents a branching event in the tree. The branching pattern of a tree is called its topology. The topology shows how organisms are related to one another.

Depending on the size of your screen and the size of the tree, it may not be physically possible to display all of the sequence names on the tree, so Geneious will only display some of the sequence names. To zoom in on the tree, use the Zoom slider under “General” in the panel on the right hand side of the tree view. To expand the distance between the branches of the tree, use the Expansion slider. As the amount of space between the branches increases, more sequence names will be displayed on the tree.

As this tree was created using an alignment in Geneious, the alignment is attached to the tree. Click on the “Alignment View” tab to view the alignment.

The sequences in the alignment are sorted according to the topology of the tree. On the left hand side of the sequence names, you can see the tree topology (this may not be visible if you are working with large trees). Select the “SIV-MON Mona monkey AY340701” sequence in the alignment then return to the “Tree View”. This sequence is now selected in the tree as well.

The sequences used to build this alignment and tree have additional meta-data associated with them (this is the data found in the “Properties” field in the “Info” tab in the individual sequence documents). This information can be displayed on the tips of the trees. To display the organism on the tips of the tree, select “Organism” from the box next to “Display” under “Show Tip Labels”.

To display the organism and host organism, hold Ctrl (on Windows) or Cmd (on Macs) and select “Organism” and “Host Organism”. Now the host organism and organism are displayed on the tips of the tree, separated by a comma. To display the sequence names on the tree, select “Names”.

Just as a sentence can be printed using different fonts, or colors of ink, without any change in meaning, so too can trees be represented in different shapes and orientations. The information encoded in the tree remains unchanged, even as the appearance changes. For example, the appearance of the tree can be changed by rotating groups of branches. To rotate the branches, select an internal node in the tree and click the Swap Siblings button at the top of the window. This will rotate the branches in that subtree however, the degree of relatedness is not altered by rotating branches in a tree. Simply having two names close together in a tree does not imply any close relationship.

Try this with the tree you have created. Select the node in the tree containing the Grivet monkey and the four Vervet monkeys and click the Swap Siblings button.

The order of these samples will change in the tree, but the relationship between the sample from the Grivet monkey and those from the four Vervet monkeys has not changed.

Rooted Trees

Trees may be unrooted or rooted. To view the HIV tree as an unrooted tree, click one of the unrooted views under the “General” options in the panel on the right hand side of the tree view.

Unrooted trees do not tell us much about evolutionary relationships. We cannot tell which node is the ancestor and which are the descendent nodes on the tree. To establish ancestor-descendent relationships we need to identify a suitable outgroup and then root the tree on the branch separating the outgroup from the remainder of the tree (the ingroup). We can specify the root before the building the tree to produce a rooted tree, or we can specify the root after the tree is built to change an unrooted tree to a rooted tree.

When you built the tree of HIV and SIV sequences you specified an outgroup (“SIV-MON Mona monkey AY340701”) so Geneious has produced a rooted tree. To view the tree as a rooted tree, click the rooted view under the “General” options in the panel on the right hand side of the tree view.

Rooted phylogenetic trees may be oriented horizontally, as above, or vertically. Here the time axis is implicit, running from left to right. The node at the left end of the tree is the root node, which represents the oldest point on the tree. As we move from the root node, we can identify nodes which are ancestral to their descendent clades. Working in from the tips of the tree enables us to identify close and distant relatives. The degree of relatedness of any two organisms is given by how far back on a rooted tree you must go to find their common ancestor. If, in tracing back to the common ancestor of A and B, you pass the common ancestor of A and C, then you can say that A and C are more closely related than A and B.

On a rooted tree, each node and all of its descendent nodes form a clade. This is what we would commonly refer to as a “branch” on a real tree – the physical branch and all the little branches and leaves attached to it. Because an unrooted tree lacks the time axis described above, it is inappropriate to discuss clades in that context.

Phylograms and cladograms

The lengths of the branches of a tree may be arbitrary (eg. cladogram) or can represent the amount of the evolutionary change (phylogram).

In a phylogram, the lengths of the branches are proportional to the amount of change which occurred between those branching events. As the tree you built was estimated using a distance (1 – similarity) measure (i.e. NJ), the proximity of nodes represents their overall degree of similarity.

To display the lengths of the branches of the tree, in the panel on the right hand side of the tree view, select “Substitutions per site” from the dropdown box next to “Display” under “Show Branch Labels”.

On your tree, find “SIV-MAC Rhesus macaque M33262” and “SIV-MNE Pig-tailed macaque U79412” and look at the length of the branches separating these two taxa. Now find “SIV-RCM Red-capped mangabey AF382829” and “SIV-RCM Red-capped mangabey AF349680” and look at the length of these branches. The length of the branches separating the SIV-MAC and SIV-MNE sequences is shorter than the length of the branches separating the two SIV-RCM sequences. From this you can conclude that SIV-MAC is more similar to SIV-MNE, than the two SIV-RCM sequences are to each other.

If an optimality method (e.g., MP or ML) was used to estimate the tree then the proximity of two nodes reflects the number of evolutionary changes in character states estimated to have occurred between them. If the total branch length from the root of a tree to organism A at one tip is much greater than from the root to organism B at another tip, then you can say that evolution has been faster in the A lineage than in the B lineage for the characters on which the tree was based.

To transform the tree to a cladogram, tick the Transform branches box in the “Formatting” options. In the dropdown box next to Transform select Cladogram

Notice how the branch lengths of the tree change and all of the tips of the tree are aligned on the right hand side of the tree view. With this transformation the lengths of the branches are meaningless. If you now look at “SIV-MAC Rhesus macaque M33262” and “SIV-MNE Pig-tailed macaque U79412” and then look at “SIV-RCM Red-capped mangabey AF349680” and “SIV-RCM Red-capped mangabey AF382829” you can see that the branch lengths separating SIV-MAC from SIV-MNE are the same lengths as the branches separating the two SIV-RCM sequences. With the transformed branches you can not draw any conclusions about how similar the sequences are to each other.

To convert the tree back to a phylogram, untick the option Transform branches. To hide the branch lengths, untick the box next to “Show Branch Labels”.

Displaying support values

In addition to the information conveyed by the topology of the tree and the branch lengths of the tree, further information can also be written on the nodes and/or branches of the tree. The information that is available to display will depend on the tree building method and the options used. Often, support values are displayed on the tree.

Tree building methods produce the tree which best explains the information in the alignment however, it is unlikely this tree will explain all of the variation in the alignment. Not all of the sites in the alignment will support this tree and not all of the clades in the tree will necessarily be strongly supported by the alignment. For example, with rapid speciation events, there may be insufficient information in the alignment to determine the branching pattern of a group of species, and some of the clades in the tree may have only marginally more support than alternative possible clades.

If you look at the tree you have built it is difficult to tell which clades are strongly supported and which are not. For example, does the clade containing “SIV-RCM Red-capped mangabey AF382829” and “SIV-RCM Red-capped mangabey AF349680” have the same support from the alignment as the clade containing “SIV-MND Mandrill AY159322” and “SIV-MND Mandrill AF367411”?

To find out how strongly the alignment supports each of the clades in the tree, we can calculate support values. In the tree building options you selected the “Bootstrap” resampling method. The bootstrap statistic for a clade in the tree is the percentage of times that clade appeared in the set of bootstrap replicate trees. This percentage ranges from 0% (the clade did not appear in any of the bootstrap trees) to 100% (the clade appeared in all of the bootstrap trees). A bootstrap replicate tree is generated by randomly sampling sites, with replacement, from the alignment, to create a new randomised alignment and then building a tree from this sampled alignment. This process is repeated for the specified number of bootstrap replicates (in your case, this was 100).

To show the bootstrap values on the tree, tick the box next to Show Branch Labels and select Consensus Support (%) from the dropdown box next to “Display”.

The bootstrap value for a clade will appears to the left of the most recent common ancestral node for that clade.

Now the bootstrap values are displayed on the tree, you can see that there is strong support (100%) for the clade containing the SIV-RCM sequences. However the clade containing the two mandrill sequences has less support (55%). Note that due to the nature of the bootstrapping process, the support values on your tree may be slightly different.

Sometimes it is useful to collapse nodes that have little bootstrap support so that these do not contribute to the topology of the tree. This can be done in the bootstrapping options when the tree is built by changing the Support threshold value. If this is set on 50%, nodes with bootstrap support of less than 50% will be collapsed into polytomies. The screenshot below shows an example where the nodes with 38% and 36% bootstrap support in (A) are collapsed when the support threshold is set to 50% (B).


A speculatively rooted tree for rRNA genes, showing the three life domains Bacteria, Archaea, and Eucaryota, and linking the three branches of living organisms to the LUCA (the black trunk at the bottom of the tree) cf. next graphic.

A rooted phylogenetic tree, illustrating how Eukaryota and Archaea are more closely related to each other than to Bacteria (based on Cavalier-Smith‘s theory of bacterial evolution). Neomura is a clade composed of two life domains, Archaea and Eukaryota. LUCA, a variant of LUA, stands for last universal common ancestor.

A phylogenetic tree or evolutionary tree is a branching diagram or “tree” showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. The taxa joined together in the tree are implied to have descended from a common ancestor. Phylogenetic trees are central to the field of phylogenetics.

In a rooted phylogenetic tree, each node with descendants represents the inferred most recent common ancestor of the descendants, and the edge lengths in some trees may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are generally called hypothetical taxonomic units, as they cannot be directly observed. Trees are useful in fields of biology such as bioinformatics, systematics, and phylogenetic comparative methods.

Unrooted trees illustrate only the relatedness of the leaf nodes and do not require the ancestral root to be known or inferred.

The idea of a “tree of life” arose from ancient notions of a ladder-like progression from lower to higher forms of life (such as in the Great Chain of Being). Early representations of “branching” phylogenetic trees include a “paleontological chart” showing the geological relationships among plants and animals in the book Elementary Geology, by Edward Hitchcock (first edition: 1840).

Charles Darwin (1859) also produced one of the first illustrations and crucially popularized the notion of an evolutionary “tree” in his seminal book The Origin of Species. Over a century later, evolutionary biologists still use tree diagrams to depict evolution because such diagrams effectively convey the concept that speciation occurs through the adaptive and semirandom splitting of lineages. Over time, species classification has become less static and more dynamic.

Rooted tree

A rooted phylogenetic tree (see two graphics at top) is a directed tree with a unique node corresponding to the (usually imputed) most recent common ancestor of all the entities at the leaves of the tree. The most common method for rooting trees is the use of an uncontroversial outgroup—close enough to allow inference from trait data or molecular sequencing, but far enough to be a clear outgroup.

Unrooted tree

An unrooted phylogenetic tree for myosin, a superfamily of proteins. [1]

Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry. They do not require the ancestral root to be known or inferred. [2] Unrooted trees can always be generated from rooted ones by simply omitting the root. By contrast, inferring the root of an unrooted tree requires some means of identifying ancestry. This is normally done by including an outgroup in the input data so that the root is necessarily between the outgroup and the rest of the taxa in the tree, or by introducing additional assumptions about the relative rates of evolution on each branch, such as an application of the molecular clock hypothesis. [3]

Bifurcating tree

Both rooted and unrooted phylogenetic trees can be either bifurcating or multifurcating, and either labeled or unlabeled. A rooted bifurcating tree has exactly two descendants arising from each interior node (that is, it forms a binary tree), and an unrooted bifurcating tree takes the form of an unrooted binary tree, a free tree with exactly three neighbors at each internal node. In contrast, a rooted multifurcating tree may have more than two children at some nodes and an unrooted multifurcating tree may have more than three neighbors at some nodes. A labeled tree has specific values assigned to its leaves, while an unlabeled tree, sometimes called a tree shape, defines a topology only. The number of possible trees for a given number of leaf nodes depends on the specific type of tree, but there are always more multifurcating than bifurcating trees, more labeled than unlabeled trees, and more rooted than unrooted trees. The last distinction is the most biologically relevant it arises because there are many places on an unrooted tree to put the root. For labeled bifurcating trees, there are:

total unrooted trees, where n represents the number of leaf nodes. Among labeled bifurcating trees, the number of unrooted trees with n leaves is equal to the number of rooted trees with n − 1 leaves. [4]

Special tree types

This section does not cite any sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. (October 2012) (Learn how and when to remove this template message)

A spindle diagram, showing the evolution of the vertebrates at class level, width of spindles indicating number of families. Spindle diagrams are often used in evolutionary taxonomy.

A highly resolved, automatically generated tree of life, based on completely sequenced genomes. [5] [6]

  • A dendrogram is a broad term for the diagrammatic representation of a phylogenetic tree.
  • A cladogram is a phylogenetic tree formed using cladistic methods. This type of tree only represents a branching pattern i.e., its branch spans do not represent time or relative amount of character change.
  • A phylogram is a phylogenetic tree that has branch spans proportional to the amount of character change.
  • A chronogram is a phylogenetic tree that explicitly represents evolutionary time through its branch spans.
  • A spindle diagram (often called a Romerogram after the American palaeontologist Alfred Romer) is the representation of the evolution and abundance of the various taxa through time.
  • A Dahlgrenogram is a diagram representing a cross section of a phylogenetic tree
  • A phylogenetic network is not strictly speaking a tree, but rather a more general graph, or a directed acyclic graph in the case of rooted networks. They are used to overcome some of the limitations inherent to trees.


Phylogenetic trees composed with a nontrivial number of input sequences are constructed using computational phylogenetics methods. Distance-matrix methods such as neighbor-joining or UPGMA, which calculate genetic distance from multiple sequence alignments, are simplest to implement, but do not invoke an evolutionary model. Many sequence alignment methods such as ClustalW also create trees by using the simpler algorithms (i.e. those based on distance) of tree construction. Maximum parsimony is another simple method of estimating phylogenetic trees, but implies an implicit model of evolution (i.e. parsimony). More advanced methods use the optimality criterion of maximum likelihood, often within a Bayesian Framework, and apply an explicit model of evolution to phylogenetic tree estimation. [4] Identifying the optimal tree using many of these techniques is NP-hard, [4] so heuristic search and optimization methods are used in combination with tree-scoring functions to identify a reasonably good tree that fits the data.

Tree-building methods can be assessed on the basis of several criteria: [7]

  • efficiency (how long does it take to compute the answer, how much memory does it need?)
  • power (does it make good use of the data, or is information being wasted?)
  • consistency (will it converge on the same answer repeatedly, if each time given different data for the same model problem?)
  • robustness (does it cope well with violations of the assumptions of the underlying model?)
  • falsifiability (does it alert us when it is not good to use, i.e. when assumptions are violated?)

Tree-building techniques have also gained the attention of mathematicians. Trees can also be built using T-theory. [8]

Although phylogenetic trees produced on the basis of sequenced genes or genomic data in different species can provide evolutionary insight, they have important limitations. Most importantly, they do not necessarily accurately represent the evolutionary history of the included taxa. In fact, they are literally scientific hypotheses, subject to falsification by further study (e.g., gathering of additional data, analyzing the existing data with improved methods). The data on which they are based is noisy the analysis can be confounded by genetic recombination, [9] horizontal gene transfer, [10] hybridisation between species that were not nearest neighbors on the tree before hybridisation takes place, convergent evolution, and conserved sequences.

Also, there are problems in basing the analysis on a single type of character, such as a single gene or protein or only on morphological analysis, because such trees constructed from another unrelated data source often differ from the first, and therefore great care is needed in inferring phylogenetic relationships among species. This is most true of genetic material that is subject to lateral gene transfer and recombination, where different haplotype blocks can have different histories. In general, the output tree of a phylogenetic analysis is an estimate of the character’s phylogeny (i.e. a gene tree) and not the phylogeny of the taxa (i.e. species tree) from which these characters were sampled, though ideally, both should be very close. For this reason, serious phylogenetic studies generally use a combination of genes that come from different genomic sources (e.g., from mitochondrial or plastid vs. nuclear genomes), or genes that would be expected to evolve under different selective regimes, so that homoplasy (false homology) would be unlikely to result from natural selection.

When extinct species are included in a tree, they are terminal nodes, as it is unlikely that they are direct ancestors of any extant species. Skepticism might be applied when extinct species are included in trees that are wholly or partly based on DNA sequence data, because little useful “ancient DNA” is preserved for longer than 100,000 years, and except in the most unusual circumstances no DNA sequences long enough for use in phylogenetic analyses have yet been recovered from material over 1 million years old.

The range of useful DNA materials has expanded with advances in extraction and sequencing technologies. Development of technologies able to infer sequences from smaller fragments, or from spatial patterns of DNA degradation products, would further expand the range of DNA considered useful.

In some organisms, endosymbionts have an independent genetic history from the host.

Phylogenetic networks are used when bifurcating trees are not suitable, due to these complications which suggest a more reticulate evolutionary history of the organisms sampled.


Evolutionary trees are (almost) always starting with an ancestor and then dividing, so you can always identify the root (if there is one) as the point where all the branches converge. Historically, it was drawn at the bottom like a real tree (as with the great Molluscan tree in OUMNH and the OneZoom Tree of Life Explorer). These days, it is usually drawn on the left as in these diagrams but I have seen trees with the root at the top, bottom or even on the right. (The latter is usually only used when mirroring another tree.) I have posted before on how to root a phylogenetic tree, so I won't go over that again here. The rooting method should be given in the methods but, when it is missing, you can often guess from the shape of the tree and using the root-to-tip branch lengths again:
Unrooted trees are pretty obvious when shown in the "radiation" style. If the tree is rooted, it is almost certainly either midpoint rooted or outgroup rooted (see "how to root a phylogenetic tree"). Midpoint rooting can be identified by virtue of the fact that the two longest root-to-tip distances will (a) be the same length and (b) be either side of the root. If either of these conditions is broken, it is not midpoint rooted and is probably outgroup rooted. (Note that if both conditions are met, it is still possible that the tree is outgroup rooted. Indeed, if the evolutionary rates are fairly consistent, outgroup rooting and midpoint rooting should be the same.)

Ideally, a rooted tree should have the root marked. Sometimes, however, it is left off, as in the bottom left. This can be confusing as tree visualising programs will often display trees in the "traditional" style even when they are not rooted. This is particularly a problem when branch lengths are not shown as it will not be at all obvious when the tree is rooted or not. The time that I see this catch people out most is when making a Maximum Parsimony tree using the popular software, MEGA - these trees are displayed randomly rooted and without branch lengths by default.

Phylogenetic Tools for Comparative Biology

Using the function drop.tip() we can easily excise a single taxon or a list of taxa from our "phylo" tree object in R. However, it is not immediately obvious how to prune the tree to include, rather than exclude, a specific list of tips. Trina Roberts (now at NESCent) shared a trick to do this with me some time ago, and I thought I'd pass it along to the readers of this blog.

First, let's start with a tree of 10 species:

> tree write.tree(tree)
[1] "(t8:0.22,((((t3:0.9,(t7:0.48,t2:0.5):0.12):0.47,t6:0.55):0.08,(t5:0.49,(t9:0.71,t10:0.13):0.15):0.7):0.87,(t1:0.72,t4:0.62):0.55):0.47)"

Now, say we want to keep the species t2 , t4 , t6 , t8 , and t10 in our pruned tree, we just put these tip names into a vector:

[More commonly, this vector will probably come from the row names in our data matrix, or we might read it from a text file.]

We create the pruned tree with one command:

Now we have our pruned tree, as desired:


If there are tips in the "species" vector that are not in the tree, match(species,tree$tip.label) will one or mulitple NAs, and the procedure will fail. To avoid this problem, one can just do:
> pruned.tree<-drop.tip(tree, tree$tip.label[-na.omit(match(species, tree$tip.label))])

Even less code than the -match trick:

pruned.tree<-drop.tip(tree, setdiff(tree$tip.label, species))

setdiff is very handy. (as is intersect and %in%)

Dan's method will also work even if some of the labels in "species" are not in "tree."

Watch the video: Phylogenetic analysis for beginners using MEGA 11 software (January 2023).