Information

Sequence evolution simulation tool

Sequence evolution simulation tool


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm looking for a tool to simulate sequence evolution given a specific mutation model and birth-death model. I'm aware of tools and packages like INDELible, Seq-Gen and PhyloSim, but they simulate evolution along phylogenetic trees. What I want is to give a sequence and ask the tool to simulate the evolution of this sequence through N generations. The output should be a big bunch of descendant sequences. I've been thinking about writing such a tool myself, but it's always better to look for something already coded.


There exists a bunch of population genetics forward and backward (coalescence) simulation platforms. Here is a non-exhaustive list. They all differ and you'll have to go through their manual to see what is more adapted to your needs.

List of softwares

Here is a long list of such platforms. The list might arguably be a little bit outdated today though and many of these softwares might be slow or abandoned by their authors.

Most common softwares to my experience

Some are more known than others. Personally, I already saw uses of the following platforms in publication: SimCoal, Nemo, Slim and SFS_Code.

SimBit

Of course, I must indicate my own simulation platform SimBit. To my experience, SimBit is typically faster than Nemo. It is slower than SFS_Code and SLim for very small simulations but become much much faster than SFS_Code and SLim when you need simulations with a descent amount of genetic diversity.

Most important point about these softwares

You should compare softwares based on

  • Availability of the author to advice and bring new features
    • I won't talk about my personal experience here by respect for the authors!
  • Speed / RAM usage
    • I think SimBit is typically faster (and use less RAM) than Nemo. SFS_code and SLim are very fast for simulations with low genetic diversity but very slow for simulations with high genetic diversity.
  • Flexibility
    • Nemo, SLim and SimBit are very flexible (but they do different things) but SFS_code is not really.
  • User interface
    • I find Nemo error report pretty poor. I like SimBit error report. SLim interface is very nice (using eidos) and it even comes with a GUI (I have never used the GUI myself).
  • Free of Bugs
    • Several people have found important bugs in SFS_code (personal communication and experience).

Slightly more detailed comparison between Nemo, SLim, SFS_code and SimBit

I have personally used NEMO, SFS_CODE and SLim in the past (I now use SimBit only). So I can only talk about these 4 below.

All four models are well updated and maintained by their authors and of course everyone has a access to the source code.

SFS_code and SLim are very fast for small simulations. The issue of SFS_code and SLim is that the RAM usage and run time are exponential function of the genetic diversity. This means that some simulations can quickly become totally unmanageable. But it is a great tool if you expect little genetic diversity.

SimBit is typically faster than Nemo, at least for large genomes. This difference becomes particularly obvious for large genomes. One might think "Oh I am fine to wait a few extra days to get my results" but the difference in speed between two softwares can also be a difference in whether you have to wait a week or 20 years to get your results, so do not neglect the importance of running time and RAM usage without estimating your needs.

Nemo is very flexible when it comes to the order of life-cycle events (is migration before or after selection for example) and the newest version allows for simulating age structured population. SimBit is very flexible, very fast and allow for simulating multi-species and their ecological interactions. SFS_code is less flexible.


My only concerns about NEMO are that each individual dies after reproduction and that there can't be more than 256 alleles at the same time.

SFS_code limits the number of alleles to 4. Nemo limits the number of alleles to 256 (but you can also use phenotype and have a float number at each locus). SimBit uses either bi-allelic loci, a block in which the number of mutations are counted (max: 256 mutations; this is different than 256 alleles per say) or continuous float number at each locus for the phenotype. Do you really need more alleles? Why can't you just use several loci to represent a sequence with multiple alleles? Real biology can only have 4 alleles in its smallest locus! SimBit uses a bit-per-bit system. If you need, say 50,000 alleles, then you just need to ask for a sequence of 16 bits (2 bytes) and you can get $2^{16} = 65,536$ alleles.

If you need more alleles in a single block (which is a little surprising to me a priori), you can probably just ask the author and he might accept to helping you if you convince him it is of interest to add this feature (he might also be more willing to help in exchange of authorship if you need more support).

Hope that helps. Good luck!


Messer Lab – SLiM

SLiM is an evolutionary simulation framework that combines a powerful engine for population genetic simulations with the capability of modeling arbitrarily complex evolutionary scenarios. Simulations are configured via the integrated Eidos scripting language that allows interactive control over practically every aspect of the simulated evolutionary scenarios. The underlying individual-based simulation engine is highly optimized to enable modeling of entire chromosomes in large populations. We also provide a graphical user interface on macOS and Linux for easy simulation set-up, interactive runtime control, and dynamical visualization of simulation output.

A 4–5 day SLiM Workshop is now available online. The SLiM Workshop is also offered in person from time to time see the SLiM Workshops subsection below for more information.

Downloads (version 3.6)

The macOS Installer will install the slim command-line tool, the SLiMgui graphical development environment, and both manuals. (If you are on macOS and you wish to run the old Cocoa-based SLiMguiLegacy app instead, for some reason, you can download it from http://www.benhaller.com/slim/SLiMguiLegacy.app.zip or build it in Xcode it is no longer included in the installer.) On other Un*x platforms, you should follow the instructions in the SLiM manual (chapter 2) there may be an installer for your platform, or you may need to build from sources. SLiM can also run on Windows, under the WSL see the manual for details.

The SLiM Manual includes a collection of recipes for common situations. You can download a zip archive of those recipes, if you wish they are also directly available through SLiMgui’s File menu. The Eidos Manual covers the details of the Eidos language, used to control SLiM via scripting. Reference sheets for both SLiM and Eidos make them easy to use with minimal use of the full documentation.

Note that the source code archive provided here contains neither macOS specific code, nor the Xcode project for SLiM it is intended for users on Linux (macOS users are strongly encouraged to use the macOS Installer rather than building from sources). The complete sources including macOS files can be found on GitHub you can get the sources for a tagged release, such as 3.6, or for the current development head.

We also provide a GitHub repository called SLiM-Extras with additional useful tidbits for users of SLiM, such as user-defined Eidos functions for performing some common tasks, and we welcome contributions to that repository from others.

SLiMgui

With the SLiMgui graphical modeling environment (compatible with macOS, Linux, and Windows under the WSL), you can visualize your simulation as it runs and examine its parameters in real-time, allowing for much easier simulation development.

A screenshot of SLiMgui simulating the population dynamics of a CRISPR/Cas9 “gene drive” in a six-subpopulation island model with spatial variation in selection on the driver allele. The Eidos scripting area is on the left, output is on the right. A visual representation of the population structure is shown in the subwindow, and all of the individuals in the six subpopulations can be seen at top (colored according to their relative fitness). The central black bar shows a summary of the genetic variation present in the population rare neutral mutations are visible as short yellow bars, and the tall red bar represents the driver allele approaching fixation.

Mailing lists

There are two mailing lists. Please subscribe to either or both using the links below:

    : A low-volume mailing list for announcements only. Users may not post to this list. : A higher-volume mailing list for questions, comments, bug reports, and discussion among users of SLiM.

SLiM Workshops

We run 5-day SLiM workshops that are free and open to the public (with registration). At present, however, these workshops are on hold due to the coronavirus pandemic.

The workshop materials are now online, for people who want to do it themselves visit the online workshop download page for more information.


The first SLiM workshop: Umeå, Sweden, August 2019.

News

2021 March 3: SLiM 3.6 is released! This is a major release, with many new features and several important bug fixes. This upgrade is strongly recommended for all users version 3.5 should not be used due to two bugs that could cause incorrect model results. For more information, see slim-announce for the announcement and release notes.

2020 December 8: SLiM 3.5 is released! This is a major release, with many new features and several important bug fixes. This upgrade is recommended for all users, but it breaks backward compatibility/reproducibility in several ways read the release notes carefully. For more information, see slim-announce for the announcement and release notes.

2020 May 12: SLiM 3.4 is released! This is a major release, including the first release of QtSLiM, as well as a few other new features and fixes for a couple of bugs. This upgrade is recommended for all users. For more information, see slim-announce as usual.

2020 January 30: SLiM 3.3.2 is released! This is a minor release, with several small improvements and fixes for a couple of bugs (one significant). This upgrade is recommended for all users. For more information, see slim-announce as usual.

2020 January 13–17: We had a SLiM workshop on our home turf, at Cornell! We’ve done a couple so far (Sweden, the UK, New York City) but this was our first at Cornell. We had 28 attendees, and it went very well, including a catered lunch sponsored by 3CPG. We’ve got more workshops planned (see the previous subsection) please contact us if you’d like to host a workshop at your institution!

2019 September 28: SLiM 3.3.1 is released! This version is a minor release, mostly bug fixes (including a few rare but nasty bugs!). This upgrade is recommended for all users. For more information, see the new manuals, and the release notes in the announcement on slim-announce.

2019 May 15: SLiM 3.3 is released! This version is a major release, with big new features (nucleotide-based models, mutation() callbacks, etc.) and some big, important bug fixes. This upgrade is strongly recommended for all users. For more information, see the new manuals, and the release notes in the announcement on slim-announce.

2019 January 29: SLiM 3.2.1 is released! This version is a minor release, providing some smaller feature additions and a couple of new recipes, as well as bug fixes. For more information, see the new manuals, and the release notes in the announcement on slim-announce.

2019 January 18: In recent days we have had three (!) new papers published related to SLiM. (1) “SLiM 3: Forward genetic simulations beyond the Wright–Fisher model” describes the support for non-Wright–Fisher models and continuous space in SLiM 3 (DOI). (2) “Evolutionary modeling in SLiM 3 for beginners” walks new users through making a simple model in SLiM 3, with lots of explanations of basic concepts (DOI). (3) “Tree‐sequence recording in SLiM opens new horizons for forward‐time simulation of whole genomes” discusses the new tree-sequence recording feature in SLiM 3 in detail, with several examples of its practical application to speeding up model execution, burn-in simulation, and data analysis (DOI). We hope these papers are useful for getting up to speed on all the new stuff we’ve been working on!

2018 November 6: SLiM 3.2 is released! This version greatly improves the performance of large nonWF models, and provides numerous improvements and bug fixes. For more information, see the new manuals, and the release notes in the announcement on slim-announce.

2018 September 3: SLiM 3.1 is released! This version greatly improves the performance of spatial interactions, and provides improvements to tree-sequence recording, among other improvements and bug fixes. For more information, see the new manuals, and the release notes in the announcement on slim-announce.

2018 July 1: SLiM 3.0 is released! This is our first full version upgrade since SLiM 2.0 was released in early 2016. It adds support for non-Wright-Fisher (nonWF) models and tree-sequence recording, two features that greatly increase SLiM’s power and flexibility. Many smaller improvements have been made too. For more information, see the new manuals, and the release notes in the announcement on slim-announce.

2017 December 16: SLiM 2.6 is released! This is a major release, with lots of new stuff see the new manuals, and the release notes in the announcement on slim-announce.

2017 October 27: SLiM 2.5 is released! This is a major release, with lots of new stuff and some important bug fixes see the new manuals, and the release notes in the announcement on slim-announce.

2017 September 12: SLiM 2.4.2 is released (fix for a bug that could cause incorrect output from many/most models).

2017 July 26: SLiM 2.4.1 is released (fix for a bug that could cause crashes or incorrect results in multi-subpopulation simulations).

2017 July 14: SLiM 2.4 is released! This is a major release, with speed improvements for many types of models, new support for runtime profiling in SLiMgui, and many other features. See the new SLiM and Eidos manuals for current documentation. There are also release notes in the announcement on slim-announce.

2017 April 18: SLiM 2.3 is released! This is a major release, notably adding support for continuous space and spatial interactions. See the new SLiM and Eidos manuals for current documentation. There are also release notes in the announcement on slim-announce.

2017 February 22: SLiM 2.2.1 is released (new recipes, minor features and fixes).

2016 December 8: SLiM 2.2 is released! This is a major release with big performance improvements and several new features. See the new SLiM and Eidos manuals for current documentation. There are also release notes in the announcement on slim-announce.

2016 November 8: SLiM 2.1.1 is released (mostly bug fixes, some serious).

2016 October 4: Our publication on SLiM 2 is out in Molecular Biology and Evolution: DOI.

2016 September 19: SLiM 2.1 is released! This is a major release with lots of features added. See the new SLiM and Eidos manuals for current documentation. There are also release notes in the announcement on slim-announce.

2016 May 26: SLiM 2.0.4 is released (minor bug fixes).

2016 May 12: SLiM 2.0.3 is released (improvements to code completion).

2016 May 6: SLiM 2.0.2 is released (minor feature additions).

2016 April 27: SLiM 2.0.1 is released (minor bug fix and minor feature addition).

2016 April 1: SLiM 2.0 is released! We are excited to announce SLiM 2.0, a new major release of the SLiM package. SLiM 2.0 adds scriptability with Eidos, and interactive simulation development using the SLiMgui application. We have put up a blog post with more details about the SLiM 2.0 release.

License and citation

SLiM is free open-source software, licensed under the GNU General Public License version 3. If you use SLiM in your research, please cite us.

For SLiM 3, cite:

Haller, B.C., & Messer, P.W. (2019). SLiM 3: Forward genetic simulations beyond the Wright–Fisher model. Molecular Biology and Evolution 36(3), 632–637. DOI

For models using tree-sequence recording, cite:

Haller, B.C., Galloway, J., Kelleher, J., Messer, P.W., & Ralph, P.L. (2019). Tree‐sequence recording in SLiM opens new horizons for forward‐time simulation of whole genomes. Molecular Ecology Resources 19(2), 552–566. DOI

If appropriate, cite our “protocol” paper for beginning SLiM users:

Haller, B.C., & Messer, P.W. (2019). Evolutionary modeling in SLiM 3 for beginners. Molecular Biology and Evolution 36(5), 1101–1109. DOI

For older models using SLiM 2, cite:

Haller, B.C., & Messer, P.W. (2017). SLiM 2: Flexible, interactive forward genetic simulations. Molecular Biology and Evolution 34(1), 230–240. DOI

And for SLiM 1, cite:

Messer, P.W. (2013). SLiM: Simulating evolution with selection and linkage. Genetics 194(4), 1037–1039. DOI

The SLiM icon

Graphics files for the SLiM icon are downloadable here. These images are: Copyright (c) 2016–2019 Philipp Messer, All Rights Reserved. Permission is hereby granted for re-use specifically for SLiM-related purposes that are respectful and supportive of the SLiM community. If you have any doubt as to this, please contact us for permission. Various sizes and formats are provided.

Older versions

For new projects, using the current version is strongly recommended. Old versions from SLiM 2.0 onward can be found on GitHub using the release tags. These old versions are no longer supported.

Publications

It can be useful to see how others are using SLiM sometimes you can even download the actual SLiM model used in a paper. However, updating a list of all the publications that cite SLiM became too time-consuming, so now we provide links for Google Scholar searches, for papers that cite our SLiM papers.


Installing PopG

Here are instructions for saving, unpacking, and installing PopG from different browsers, and on operating systems. We cover the Chrome, Firefox, Safari, and Internet Explorer browsers on the Windows, Mac OS X, and Linux operating systems.

  1. Click on the link.
  2. A downwards-pointing animated arrow in the lower-left portion of the browser window will move, pointing to a tab there called PopG.zip.
  3. PopG.zip will now be found in your Downloads folder.
  4. Click (or if this does not work, double-click) on the PopG.zip file. The Zip archive will be extracted into the Downloads folder. A folder PopG will be created in the Downloads folder.
  5. Move that folder to where you want it to be.
  1. Click on the link.
  2. A dialog box opens and offers to let you Save File. Choose that.
  3. The Zip archive PopG.zip will be in your Downloads folder.
  4. Double-click on it. The archive will be extracted and a folder PopG created in the Downloads folder.
  5. Move that folder to where you want it to be.
  1. Click on the link.
  2. A dialog box opens and offers to let you Save File. Choose that.
  3. The Zip file PopG.zip will be downloaded into the downloads folder and will be automatically extracted. A folder PopG will be created in the Downloads folder.
  4. Move it where you want it to be.
  1. Click on the link.
  2. The Zip file will be downloaded into the downloads folder and will be automatically extracted. A folder PopG will be created in the Downloads folder.
  3. Move it where you want it to be.

A MAC PROBLEM: On Mac OS X systems, when you attempt to extract the Zip archive, or when you attempt to run the Java executable, the system may complain that this is from an unknown developer. That is simply because I did not sign the file with my Apple Developer ID. You should be able to make the operation work by control-clicking on the icon and selecting the option to open the file, using the defaults suggested. Once it successfully gets past this, it will not bother you with this again.

The Java archive

The Java archive file PopG.jar will exist in the folder PopG once you have downloaded and installed PopG. If you have Java installed on your system, you should be able to run the Java program by finding the folder PopG and clicking or double-clicking on the icon of the file PopG.jar

The documentation page

The PopG folder also includes the present documentation web page which you are now reading. This can be read here or you can use the Save As menu item in the File menu of your browser to save a copy of it on your machine. The latest version of this page can be read on the Web using this link.

Older versions of PopG

There are also older executable versions compiled for Windows, Mac OS X, and Linux systems, plus some even older operating systems. These can be fetched from folder old at our PopG site. Most users should not use these older executables, but if you do, you should start by reading the README file in that folder. One of the versions there is version 3.4, which has compiled executables for the three major operating systems available as well as C source code. These may be useful if you do not have Java and cannot install it on your system.

Where are the Android and iOS versions?`

We would like to make versions available for tablets and even phones. Unfortunately, a version of Java that can use the graphics functions does not seem to exist on the Android operating system and the iOS operating system. We would have to rewrite the program separately for each of those. If you know of a way to run our Java executables on either of those operating systems, and get it to work, please let us know how you did that.

Making sure you have Java on your computer

If you have Java installed you can run the PopG program. Generally, Java will be already installed on Mac OS X systems and on Linux systems. If you aren't sure if you have Java installed, you can type java -version in a command window and, it Java exists, it will tell you what the version is. If you get back a blank line, you need to either download Java or append where it is to your search path. On Windows systems and on Mac OS X or Linux systems that do not have Java, you can install a recent version of Java at no cost by using this link: java.com. Recent Linux and Mac OS X systems usually have a recent-enough version of Java already installed. Mac OS X systems 10.4 (Leopard) and earlier may not have a recent-enough Java to be able to run PopG. Windows systems do not come with Java already installed, but it can be installed on them from the above web site.

Running the program

To run the PopG Java program you should simply click (or double-click) on the icon of the PopG.jar file (you can also run it from a command window by navigating to where PopG.jar is stored and typing java -jar PopG.jar). The start up screen looks like this:

There are two menus, File and Run , that control PopG. They are in the upper left of the main PopG window.

The Run menu

The Run menu contains five items: Continue w/ , Continue , New Run , Restart , and Display Whole Plot .

The first time it is picked, it looks like:

with all but New Run grayed out. Once you have done your first run, all the selections will be active.

New Run Initially only New Run is available. It brings up the following dialog:

It contains all the parameters that control a PopG run. Note that usually you do not enter a Random Number seed unless you want to do two identical runs. When you are finished editing you can click the OK box to start the run. You can also click Cancel to not start the run and Defaults to reset all the data entry boxes to their default values. Continue w/ This choice continues the run, for the same number of generations as previously entered in the New Run menu. Continue This continues the run, but presents the following dialog:

which allows you to change the number of generations run in the next continuation of the run. Restart This restarts the run with the same parameter values as before. If you want to change some of the parameter values, use New Run instead. Display Whole Plot This plots all generations on the same plot. During a run the plots will normally show only the last group of generations. This shows all generations that have been simulated so far. This is particularly handy when you have finished a simulation and want to print the results of the whole run.

The random number generator

The program uses a random number generator which automatically initializes from your system clock. Thus it should give you a different sequence of random numbers and thus a different result every time you run the program. In the menu for a new run, there is a setting for Random number seed which is set by default to (Autogenerate) , which will initialize from the system clock. You probably won't have any reason to change this, unless you are debugging PopG and want to do the same run, with the same random outcomes, twice. If you do wish to do the same exact run twice, enter a value in place of the (Autogenerate) string and PopG will use that to initialize the random number generator. Assuming you have not modified the calcPopG routine within the Java code, every time you start with that random number you will get exactly the same results.

The File menu

This contains four menu items. They are Save , Print , About and Quit .

The first time it is displayed, it looks like:

with Save and Print grayed out. Once you have done your first run, they will be active.

Save This opens a standard save file dialog and allows you to save the graph as a JPG or PNG file. The default name is PopG with the appropriate extension to match to file format. Print This opens a standard print dialog and allows you to select a printer and print the graph. About Displays the program's copyright notice. Quit This is self-explanatory: the program quits.

Compiling it yourself

Most people will not need to compile the program themselves as the Java Jar package supplied should run on most versions of Java. So you should probably skip this section. But if you wish to modify the functionality of PopG or if you have some unusual Java environment that will not run the supplied Jar file you will need a Java compiler. We repeat: If you just need to run the program, you should run the Jar file that comes in our distribution. You do not need to compile anything (though you may need to install Java).

If you do need to compile the program, you will find a src directory in the downloaded and unzipped folder PopG which you got from our site. Import the file PopGUserInterface.java from src into your favorite Java editor (we used Eclipse). You can either execute it directly from there or export a Java Jar from the editor and execute it. PopGUserInterface.java does not reference any external libraries, everything it needs is in the JavaSE-1.6 system library. If you are modifying our program, once you have finished doing that you should have no problems creating the Java Jar,

If you cannot do, tell us, since that would be a bug.

Simulating with PopG

This program simulates the evolution of random-mating populations with two alleles, arbitrary fitnesses of the three genotypes, an arbitrary mutation rate, an arbitrary rate of migration between the replicate populations, and finite population size.

The programs simulate simultaneously evolving populations with you specifying the population size, the fitnesses of the three genotypes, the mutation rates in both directions (from A to a and from a to A ), and the initial gene frequency. They also ask for a migration rate among all the populations, which will make their gene frequencies more similar to each other. Much of the time (but not always!) you will want to set this migration rate to zero. In most respects the program is self-explanatory.

Initially there are ten populations. You can set the number of simultaneously-evolving populations to any number from 0 to 1000. The population size for each population can be any number from 1 to 10000. Note that a larger population, a larger number of generations run, and a larger number of populations can lead to longer runs.

When you make a menu selection that causes the program to run, a graph of the gene frequencies of the A allele in each of the populations will be drawn in the window. Here is what the graph looks like when we run with an initial gene frequency of 0.2 and fitnesses of AA, Aa, and aa set to 1.08, 1.04, and 1, with all other parameters in their default values. (Note that if you try this run, there will be different random numbers, so your result will be a bit different).

Note that the window can be resized, and the graph should adjust to this. There is also a blue curve that shows what the gene frequencies would be in an infinite population (one with no genetic drift). If the number of populations being simulated is set to zero, this curve is all you will see. The graph can be printed using the Print option of the File menu, or saved to a Postscript file using the Save option in that menu.

Note that once the plot of the gene frequency curves reaches the right-hand side of the graph, the program prints there the number of populations that fixed for the A allele (ended up with a frequency of 1.0) and the number that lost this allele.

Suggestions

  • Try cases with no mutation, no migration, and all fitnesses 1.0 so that there is no selection. Does genetic drift in a population of size 1000 accomplish roughly the same changes in 1000 generations as genetic drift in a population of size 100 does in 100 generations? By running a largish number of populations, can you check whether the probability that an allele is fixed by genetic drift is equal to its initial frequency in the populations?
  • Try a case with no mutation or migration, with the A allele favored by natural selection (with fitness of the AA genotype set highest and fitness of the aa genotype set lowest). Start with a small frequency of A . Is it always fixed? If one starts with a single copy of the allele, how does the probability that A is fixed compare with the selection coefficient favoring it in the heterozygote (the fraction by which the Aa genotype is higher compared to the fitness of the aa genotype)? Is this fixation probability larger than the one you would get with the same initial frequency but with no selection?
  • Try overdominance ( Aa having the highest fitness). Does the gene frequency converge towards an equilibrium? Why does it vary from this equilibrium frequency? How large do the selection coefficients have to be to cause the gene frequency to stay away from fixation or loss for large amounts of time?
  • Try underdominance ( Aa having the lowest fitness). Is there a starting gene frequency that will result in some populations heading for fixation, and others heading for loss? If you add a small amount of migration, what will happen in the long run? What will happen if instead you add a small amount of mutation in both directions?
  • With migration but no selection or mutation, how much migration is needed to make the gene frequency curves be quite similar to each other? How much is needed to make them all end up at the same gene frequency in the long run? How is that migration rate affected by the population size?
  • With mutation but no migration or selection, how much mutation is needed to cause the gene frequencies to converge to a mutational equilibrium gene frequency? How does this value relate to the population size?
  • If an allele is selected against, can you set up mutation rates that will maintain it at low frequency in the population?

Credits

Version 4.0 of PopG, the first Java version, was written by Ben Zawadzki. His enormously effective programming made good use of mentorship and advice from our lab's Java wizard, Jim McGill.

The original version of PopG was written in the 1970s in FORTRAN by Joe Felsenstein. The interactive version then was written in C with much work by Hisashi Horino, Sean Lamont, Bill Alford, Mark Wells, Mike Palczewski, Doug Buxton, Elizabeth Walkup, Ben Zawadzki and Jim McGill. Hisashi and Sean wrote the C version, and the screen graphics for IBM PC and the first part of the Postscript printing system. Bill greatly improved and expanded the Postscript printing and the X windows graphics. Mark Wells did the original Macintosh version. Mike Palczewski greatly improved the Windows, Macintosh and X Windows graphical user interface, and Doug Buxton modified the program to the 3.0 version and prepared the executables for different operating systems. Elizabeth Walkup improved the X windows interaction and prepared version 3.3. Small documentation changes after version 4.0 were made by me.


BEAST: Bayesian evolutionary analysis by sampling trees

Background: The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented.

Results: BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81 packages. It provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions. BEAST source code is object-oriented, modular in design and freely available at http://beast-mcmc.googlecode.com/ under the GNU LGPL license.

Conclusion: BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new models and statistical methods of evolutionary analysis.


Results and Discussion

Test 1: Insertions Alone or Deletions Alone

Our first test was to run insertion-only and deletion-only simulations. Indel lengths were fixed with 1, 2, 4, and 8 residues or bases. We measured the performance of each method by 1) the length of the true multiple alignment for insertions, where the number of sites inserted is equal to the alignment length minus 1,000, and 2) the number of characters remaining in the output sequence for deletions.

Figure 5 and supplementary figure S3, Supplementary Material online, show the test results. The number of internal nodes in the guide tree had an adverse effect on the performance of SIMPROT, iSGv1.0, ROSE, and MySSP (only in the case of insertions). As a side effect of their continuous modeling of indels, overestimation of deletions ( fig. 5A and B ) and underestimation of insertions ( fig. 5C and D ) are clearly shown with fewer numbers of internal nodes. These methods calculate the expected number of indel events without adjusting the sequence length when an event occurs. DAWG, iSGv2.0, and EvolveAGene3 show no or very slight effects in indel numbers. For DAWG and iSGv2.0, this is because sequence lengths are adjusted dynamically along the branch. The unaffected results by EvolveAGene3 are likely due to the fact that this method treats branch lengths as the number of mutation-event tests that occur along a branch. For our purpose, we set the branch length to 8,000 mutation-event tests (1,000-character sequence with each site undergoing eight substitutions). EvolveAGene3 also forbids overlapping insertions and deletions, effectively reducing the deletion rates with larger deletions. This effect can be seen in supplementary figure S3 (I), Supplementary Material online, where “more” characters are left after simulations with larger deletion sizes. Because of the constant insertion and deletion probabilities set in EvolveAGene3 and our removal of codon constraints in EvolveAGene3, the lengths of the alignments are often shorter than other simulation methods.

Comparison of indel simulation performance (Test 1) among seven methods. Correct simulations are expected to produce a plot with a horizontal line. Indel sizes used are (A) size-1 deletions, (B) size-4 deletions, (C) size-1 insertions, and (D) size-4 insertions. The y-axes show the number of characters left in the leaf sequence (A,B) and the true alignment length (C, D). The average values obtained from 100 simulations are plotted. The table below summarizes the standard deviations for each data point. Supplementary figure S4, Supplementary Material online, shows all test results in color.

These results show that iSGv2.0 and DAWG performed appropriately, producing consistent results regardless of the number of internal nodes. EvolveAGene3 also behaved appropriately according to its own indel model. iSGv1.0, ROSE, SIMPROT, and MySSP are all affected by the number of internal nodes, producing artificially high or low rates for deletions or insertions, respectively.

Test 2: Including Both Insertions and Deletions with Various Pins and Pdel

To further examine the effect of indel models, we simulated both insertions and deletions using the Zipfian distribution (Chang and Benner 2004) with five methods: DAWG, ROSE, SIMPROT, iSGv1.0, and iSGv2.0. We simulated three scenarios: 1) Pins = 0.01 and Pdel = 0.03, 2) Pins = 0.02 and Pdel = 0.02, and 3) Pins = 0.03 and Pdel = 0.01, where Pins and Pdel are the number of insertions and deletions per substitution, respectively. We chose to use the Zipfian distribution because it is an empirically determined length distribution for insertion and deletion events. For MySSP, which implements a length distribution that is normally distributed based on the mean indel length given by the user, we used the expected indel length of 2.082, which is based on the Zipfian distribution with a maximum indel size of 10. EvolveAGene3 was excluded from this test because changing Pins and Pdel fundamentally alters the indel creation method in EvolveAGene3. In this test, an event counter reporting the numbers of insertions and deletions that occurred during the simulation was added to each method. Because we were unable to obtain the source code for MySSP, we calculated the number of events as follows: Each gap in the root sequence in the true multiple alignment is the effect of an insertion in the descendant sequences, and likewise, each gap in the tip sequence is the result of a deletion in the ancestral sequence. To obtain the number of insertion and deletion events, we tallied the total number of gaps in the root and tip sequences, respectively, and divided that number by the mean indel size. We measured the quality of indel simulation by comparing the numbers of insertions and deletions generated. We calculated the coefficient of variation (σ 2 /μ), which is dimensionless and makes results from different simulation methods comparable. If σ 2 /μ ≈ 0, it means that the simulation method behaved similarly between the guide trees (no effect of different number of nodes). A larger σ 2 /μ suggests that the simulation method performed differently between the guide trees with different number of nodes.

Figure 6 and table 2 show the results of this test. As expected, the number of insertions and deletions generated is affected by the insertion and deletion probabilities. For iSGv1.0 and ROSE, when PinsPdel, the number of internal nodes affects both the numbers of insertions and deletions. SIMPROT, as a result of their multiple-hit correction feature, shows much more drastic effects with the internal-node numbers than iSGv1.0 and ROSE when the insertion rate is larger than the deletion rate (Pins = 0.03 and Pdel = 0.01). Such effects are not shown when the deletion rate is larger (Pins = 0.01 and Pdel = 0.03). MySSP shows increasing numbers of indels for each test, with the most drastic change occurring when Pins = 0.03 and Pdel = 0.01. This behavior can be better understood using the results of Test 1, where in the case of insertions, the sequence length grows as more internal nodes are added, but the sequence length is stable for deletions under the same conditions.

Table 2

The Effects of Internal-Node Numbers with Varying Insertion and Deletion Rates among Methods a

σ 2 /μ
MySSP SIMPROT DAWG iSGv1.0 iSGv2.0 ROSE
insdelinsdelinsdelinsdelinsdelinsdel
(A) Pins = 0.01, Pdel = 0.030.1450.4330.0030.0120.0150.0040.2160.7260.0210.0020.2260.734
(B) Pins = 0.02, Pdel = 0.020.5070.4170.3800.4350.0140.0020.0040.0010.0220.0070.0030.000
(C) Pins = 0.03, Pdel = 0.014.8711.6752.7360.7570.0060.0030.7340.2950.0050.0310.8160.265

Test 2 results with different indel probability ratios. For each method, the total numbers of insertions (dark bars) and deletions (light bars) generated are shown for simulation experiments using the guide trees with different numbers of segments (see fig. 4 ). Note that for MySSP, we used the average expected value of the Zipfian distribution to obtain the results. For all methods, the average values obtained from 100 simulations are used.

Table 2 summarizes the degree of variation (σ 2 /μ) in the numbers of insertions and deletions among Test 2 experiments. For iSGv1.0 and ROSE, we note that the variation in the number of insertion and deletion events is higher trending toward the dominant event-probability as shown in figure 6 , whereas MySSP shows high variability in all tests, although it is most pronounced when the relative insertion rate is high. SIMPROT also shows high variation when Pins = 0.03 and Pdel = 0.01 or Pins = Pdel, although it is comparable with iSGv2.0 and DAWG when Pins = 0.01 and Pdel = 0.03. When Pins = Pdel, the indel models of iSGv1.0 and ROSE appear to be affected very little by internal-node numbers. Note that when Pins = Pdel, iSGv2.0 and DAWG show slightly larger σ 2 /μ values than iSGv1.0 and ROSE. This is a consequence of the larger number of steps with indel events and sequence-length evaluations performed by these methods. Simulation as a random walk increases the variance in sequence length at each step. DAWG and iSGv2.0 take much larger numbers of steps (600 and 1,024, respectively) compared with at most eight steps taken by all other simulators. We confirmed this result by varying the number of steps taken by iSGv2 (data not shown).

Example Application for a Protein Superfamily Simulation

Figure 7 shows examples of “true alignment” output obtained from ROSE and iSGv2.0 (see supplementary fig. S5, Supplementary Material online, for the output including iSGv1.0). As shown in table 3 , iSGv2.0 correctly conserved 80.98% of the sequence positions, whereas iSGv1.0 and ROSE, both of which cannot conserve sets of characters, correctly modeled only 61.82% and 61.88% of the positions, respectively. Of the different motifs, iSGv2.0 perfectly conserved all sites of SCR1 (see fig. 7 ). This is because this motif was present in the root alignment and conserved from the beginning of the simulation run along the lipocalin lineage. For other motifs, all substitutions were accepted until the motif became effective later in the tree, after which only substitutions conforming to the position-specific constraints were accepted.

Table 3

Performance Comparison among iSGv1.0, ROSE, and iSGv2.0 for the Calycin Superfamily Simulation a

iSGv1.0ROSEiSGv2.0Input Alignment
Percent sequence identity19.7817.7214.6215.65
Percent motif positions conserved61.8261.8880.98
Number of template violating indels b 13.1819.91 d 0
Number of rejected indels c NANA0.38

We observed multiple side effects due to the restrictions imposed by the quaternary invariable and I + γ arrays of iSGv1.0 and ROSE, respectively. As seen in figure 7 and supplementary figure S5, Supplementary Material online, conserved motifs in the multiple alignment appeared as “islands” where indels were absent. Additionally, invariable sites such as the GXW region of SCR1 were conserved for the entire column of the alignment, despite the fact that it should only be conserved among kernel lipocalins and the outlier lipocalin family of odorant-binding proteins. iSGv1.0 also simulated fewer indels, which is a side effect of the number of alignment positions that contain motif positions, reducing the number of accepting positions for indels. It appears that ROSE uses the absolute number of residues in the sequence to calculate the overall probability of an indel for a branch, regardless of the number of nonaccepting sites for indels. Differences in the indel placements between iSGv1.0 and ROSE versus iSGv2.0 are also evident. As shown in figure 7 , iSGv2.0 has a much higher number of indels along the N- and C-terminal regions of the alignment. This is because these regions had only weak constraints on their sizes: The N-terminal was constrained to 10� residues and the C-terminal 10 to 30 residues. During the simulation process, iSGv2.0 determined the size of the indel, and based on both template and motif constraints searched the sequences to find regions that could accept the indel. Most of the larger indels tended to fall in the least constrained regions. Because neither iSGv1.0 nor ROSE has such constraint capabilities, indels were placed wherever they were not forbidden by the quaternary invariable and I + γ arrays. Furthermore, the superfamily fold could not be modeled by either iSGv1.0 or ROSE. They placed an average of 13.18 and 19.91 template-breaking indels, respectively ( table 3 ). iSGv2.0 upheld the template restrictions. On average, 0.35 indels per simulation run were rejected using iSGv2.0 because there were limited acceptable positions for indels due to template constraints.

The input multiple alignment (supplementary fig. S1, Supplementary Material online) had an average pairwise sequence identity of 15.65% ( table 3 ). The 20�% range of sequence identity or lower is the so-called “twilight zone” of sequence identity (Rost 1999), which is often seen among proteins belonging to highly divergent families. iSGv1.0, ROSE, and iSGv2.0 simulated data sets in this range, with the average values over 100 runs of 19.78%, 17.72%, and 14.62%, respectively. The difference in sequence identities between iSGv1.0 and ROSE versus iSGv2.0 is explained by the global conservation of invariable sites by iSGv1.0 and ROSE, even for sequences without the lineage-specific motifs. iSGv1.0 and ROSE both showed a lower “percent motif positions conserved” ( table 3 ) indicating that some positions were conserved by them even if they did not conform to the residue constraints for different protein families.


Materials

Here are templates for some of the cardboard pieces you will need for this lab.

  • Cardboard strips approximately 6.5 cm wide (represent the bones)
  • Precut these cardboard strips into designated lengths for each lab group
    • 4x 20.5cm
    • 4x 12.5cm
    • 4x 7.5cm
    • 4x 5cm
    • 4x 3.5cm
    • Circle with a diameter of 5.5cm “X”
    • Circle with a diameter of 15cm “W”
    • Square with a length of 5.5cm “Y”
    • Square with a length of 2.5cm “Z”

    Availability and requirements

    Project name:πBUSS Project home page:http://www.rega.kuleuven.be/cev/ecv/software/pibuss Operating system(s): Platform independent Programming language: Java Other requirements: Java 1.5 or higher, BEAGLE library License: GNU Lesser GPL Any restrictions to use by non-academics: None.

    Source code of the parallel BEAST/BEAGLE utility for sequence simulation is freely available as part of the BEAST Google Code repository: http://www.code.google.com/p/beast-mcmc/.

    The Broad-platform Evolutionary Analysis General Likelihood Evaluator (BEAGLE) library has it’s both source code and binary installers available from http://www.code.google.com/p/beagle-lib.

    Scripts and input files required for repeating the simulation study presented in Example application are hosted at http://www.github.com/phylogeography/DeepRoot.


    MEGA 10.0

    MEGA (Molecular Evolutionary Genetics Analysis)is an integrated tool for automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses.

    :: SCREENSHOTS

    :: REQUIREMENTS

    :: MORE INFORMATION

    Mol Biol Evol. 2013 Dec30(12):2725-9. doi: 10.1093/molbev/mst197. Epub 2013 Oct 16.
    MEGA6: Molecular Evolutionary Genetics Analysis version 6.0.
    Tamura K, Stecher G, Peterson D, Filipski A, Kumar S.

    Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S (2011)
    MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods.
    Molecular Biology and Evolution (2011)doi: 10.1093/molbev/msr121 First published online: May 4, 2011

    Leave a Reply Cancel reply

    This site uses Akismet to reduce spam. Learn how your comment data is processed.


    Sequence evolution simulation tool - Biology

    NGSphy: phylogenomic simulation of next-generation sequencing data

    © 2017 Merly Escalona ([email protected]), Sara Rocha, David Posada

    NGSphy is a Python open-source tool for the genome-wide simulation of NGS data (read counts or Illumina reads) obtained from thousands of gene families evolving under a common species tree, with multiple haploid and/or diploid individuals per species, where sequencing coverage (depth) heterogeneity can vary among species, individuals and loci, including off-target and uncaptured loci.

    NGSphy simulates reads (or read counts) from alignments originated from single gene trees or gene-tree distributions (originated from species-tree distributions). It is designed to read directly from SimPhy (a simulator of gene family evolution) in the case of gene-tree distributions, but it can also be fed with gene trees directly. These trees can contain orthologs, paralogs and xenologs. Alignments are simulated using INDELible and can represent multiple haploid and/or diploid individuals per species. Then, either Illumina reads (using ART) or read counts are simulated for each individual, with the depth of coverage allowed to vary between species, individuals and loci. This flexibility allows for the simulation of both off-target (captured but not targeted) and uncaptured (targeted but not captured) loci.

    You will need a NGSphy settings file and the required files according to the input mode selected (see bellow).


    Sequence evolution simulation tool - Biology

    Trevolver is a Perl program for simulating non-reversible DNA sequence evolution on a fixed bifurcating tree using trinucleotide context. It relies on no external dependencies, facilitating maximum portability. Just download and run.

    To test the simulation with the example data, execute the program at the Unix command line or Mac Terminal as follows:

    Find some real examples below. Check out our preprint on bioRxiv.

    New mutation data, including de novo mutations detected in whole genome sequencing of father/mother-child 'trios', can be used to empirically estimate context-dependent mutation rates. The most commonly used context is the trinucleotide, where the flanking nucleotide on either side of a position are considered (one 5', one 3'). Unfortunately, a tool is lacking for simulation of non-reversible (i.e., asymmetric rates) DNA evolution on a fixed bifurcating (binary fully resolved) gene tree, such as that produced by coalescence simulations. Trevolver was made to fill this gap. The user must provide a bifurcating tree, a seed sequence, a 64 × 4 (trinucleotide × nucleotide) rate matrix, and a branch unit (see options). Trevolver outputs a Variant Call Format (VCF) SNP report for all tree tips (leaves taxa), as well as the history of mutations and motifs for each lineage (see output).

    Trevolver begins by placing the seed sequence on the root of the tree. It then proceeds to recursively traverse the tree from root to tip as an ordered binary tree structure. In other words, at each internal node, it processes the left child (and its left child, and so on) before processing the right child, where the left (first) descendant node is considered to be that which comes first using the alphabetic (not numeric) sort() function. Unique node ID's are assigned as the letter 'n' followed by an integer (e.g., n5), where the root is considered n=0.

    When a single branch (as opposed to a cluster) is encountered, Trevolver begins to evolve the sequence along the branch. If the parent node of the branch also happens to be the root, the seed (ancestral) sequence is used as a starting point. Otherwise, a new and temporary copy of the seed sequence is created, into which the most recent allele at each mutated site is imputed. This saves computer memory, as each node need only inherit its mutational history, not an entire sequence, from its predeccesor.

    After a new sequence is initiated on a branch, it begins to evolve under trinucleotide-context-dependent mutation rates specified by the user input. The mutation rate matrix in Trevolver follows the format of the forward-time simulation SLiM (Haller and Messer 2019): the 4 3 = 64 rows correspond to the initial states of the 64 alphabetically-ordered trinucleotides, while the 4 columns correspond to the possible derived states of the central nucleotide. For example, the third column of the first row should contain the AAA➞AGA mutation rate. Mutation rates for identities (e.g., AAA➞AAA) should be 0.

    When evolution begins on a branch, Trevolver first calculates the overall mutation rate of the sequence of length L by tallying the L - 2 trinucleotides in the sequence (i.e., the first and last nucleotides, lacking trinucleotide context, are ignored). The overall mutation rate of the sequence (u) is then used to calculate a random exponentially-distributed waiting time to the next mutation as gw = -(1 / u) × ln(x1) generations, where x is a random number between 0 (inclusive) and 1 (exclusive) (Yang 2014). If the expected waiting time (gw) is less than the length of the branch, a mutation occurs. The sequence is then examined one trinucleotide at a time, until the cumulative mutation rate along the sequence exceeds a second randomly chosen value between 0 (inclusive) and u (exclusive). A mutation then occurs at the central position of the first trinucleotide for which this condition holds and for which the mutation rate is greater than 0. Because of this context-dependence, it is possible for highly mutable trinucleotides to 'erode' over time, and the overall mutation rate of the sequence is an emergent (rather than pre-specified) value depending on the rate matrix. Indeed, the overall mutation rate may decrease (or increase) until an equilibrium triuncleotide composition is reached.

    Call Trevolver using the following options:

    --tree (REQUIRED): file containing a bifurcating evolutionary tree in newick format with branch lengths. NO NODE NAMES OR SUPPORT VALUES AT THIS TIME. Only the first encountered tree is used, i.e., embarassingly parallel analyses must be executed one level up.

    --seed_sequence (REQUIRED): FASTA file containing starting (seed) sequence at tree root, to be evolved. Only the first sequence encountered is used.

    --rate_matrix (REQUIRED): file containing 64 × 4 tab-delimited trinucleotide rate matrix where rows and columns are in alphabetical order. For example, values in the first row correspond to AAA➞AAA, AAA➞ACA, AAA➞AGA, and AAA➞ATA.

    --branch_unit (REQUIRED): a scaling factor, equal to 2N0 or 4N0 in most coalescent simulations (e.g., ms Hudson 2002), which can be multiplied by a branch length to obtain the length of the lineage in generations. Branch lengths will be multiplied by this value and rounded up to the nearest integer to determine number of generations. For example, given the value 144740, a branch length of 0.228826612 in the phylogenetic tree would correspond to 144740 × 0.228826612 = 33,120.364 generations.

    --random_seed (OPTIONAL): integer with which to seed the random number generator. If not supplied, one will be chosen and reported. More specifically, we implement the checksum approach given in Programming Perl:

    --tracked_motif (OPTIONAL): a motif to track after each mutation. For example, to report the number of CpG sites over the course of a run, specify CG like so: --tracked_motif=CG .

    --track_mutations (OPTIONAL): reports the mutation rate and count over time.

    --excluded_taxa (OPTIONAL): path of file containing a list of taxa names (comma-separated) to be treated as outgroups, i.e., excluded from the VCF file and the consensus sequence(s). This might be desirable if a small number of taxa represent outgroups, to which polymorphism in an ingroup is being compared.

    --outgroups (OPTIONAL): number of outgroups to be excluded for identification of the ingroup most recent common ancestor (MRCA) and for calculation of ingroup variant frequencies in the VCF file. Outgroups are considered to be the most deeply-branching terminal taxa (external nodes). For example, if --outgroups=2 is specified, the fixed tree is navigated starting at the root. At each internal node, the branch containing the fewest terminal taxa is considered to contain the outgroup(s). Once the user-specified number of outgroups is identified, the most recent common ancestor (MRCA) node of the remaining (ingroup) taxa is identified and reported. If a set of non-arbitrary outgroup taxa does not exist for the user-specified number (e.g., if --outgroups=2 is called, but the two deepest splits contain three rather than two taxa), a warning is printed and no outgroups are used.

    --burn_in (OPTIONAL): number of generations to 'burn in' sequence evolution before initiating evolution at the tree root. Note that this is measured in absolute number of generations, not in branch units. Standard practice is 10N or 20N generations, where N is the (effective) population size.

    --vcf_output (OPTIONAL): name of a VCF format output file to be generated in the working directory, unless a full path name is given. If not specified, a file will be printed in the working directory with a .vcf extension using the name of the tree file as a prefix.

    --suppress_seed_seq (OPTIONAL): suppress printing the ancestral (seed) sequence in the output. This might be desirable if the seed sequence is very large and its inclusion in the output consumes too much disk space.

    --suppress_MRCA_seq (OPTIONAL): prevents the MRCA (ingroup most recent common ancestor) sequence from being printed.

    --suppress_consensus_seq (OPTIONAL): prevents the consensus sequence (containing the REF allele, here defined as the major allele, at each site) from being printed.

    --suppress_MUTATION (OPTIONAL): prevents the //MUTATION data from being printed.

    --verbose (OPTIONAL): tell Trevolver to report EVERYTHING that happens. Not recommended except for development and debugging purposes.

    Example input and output files are available in the EXAMPLE_INPUT and EXAMPLE_OUTPUT directories at this GitHub page, where reproducible examples are numbered (e.g., output_example1.txt). When the random seed has not been specified, exact results can be reproduced by using the same random number seed reported in the output file present in EXAMPLE_OUTPUT . Note that, if your input file(s) (e.g., tree_6taxa.txt) are not in the working directory (i.e., where your Terminal is currently operating), you will need to specify the full path of the file name (e.g., /Users/ohta/Desktop/trevolver_practice/tree_6taxa.txt). Also note that, in the examples below, a is used simply to continue the previous command on the line.

    EXAMPLE 1: A SIMPLE SIMULATION

    EXAMPLE 2: A SIMPLE SIMULATION WITH 1,000-GENERATION BURN-IN

    EXAMPLE 3: TYPICAL USAGE (program decides random seed not verbose)

    EXAMPLE 4: SIMULATION WITH TWO OUTGROUPS

    EXAMPLE 5: EVOLVE A SINGLE SEQUENCE FOR 1 MILLION GENERATIONS

    To simulate the evolution of a single sequence, provide a tree with only one taxon and branch length. For example, to simulate the evolution of a single sequence named "my_creature" for 1 million generations, the tree file would contain, simply, (my_creature:1000000) . Provide a scaling factor ( --branch_unit ) of 1, and you're good to go:

    EXAMPLE 6: MANY OPTIONS USED

    EXAMPLE 7: MINIMUM OPTIONS WITH OUTPUT TO SCREEN

    This will automatically generate a VCF file named tree_7taxa.vcf in the working directory.

    Depending on the options specified, Trevolver will output the following data:

    The beginning of the output will report:

    • COMMAND : how Trevolver was called.
    • RANDOM_SEED : the random seed used.
    • INPT_SEQUENCE : the nucleotide sequence used as input.
    • SEED_SEQUENCE : the nucleotide sequence used to seed the simulation. If there is no burn-in time, this is identical to the INPT_SEQUENCE .
    • MRCA_SEQUENCE : nucleotide sequence of the ingroup MRCA (most recent common ancestor).
    • CONS_SEQUENCE : consensus nucleotide sequence of the ingroup at the end of the simulation (i.e., tree tips).
    • TREE : the fixed tree on which the simulation took place.
    • MRCA_GENERATION : the generation (from time 0 at the root) in which the most recent common ancestor (MRCA) of the ingroup lived.
    • MRCA_SUBTREE : the subtree for which the MRCA is the root.
    • OUTGROUPS : names of the outgroups, if applicable.
    • MRCA_NODE_ID : node ID of the MRCA.
    • VCF_OUTPUT_FILE : file name of the VCF output.

    Additionally, following a brief SUMMARY OF RESULTS, the following flags indicate separate sections of more detailed output:

    • //MUTATION : the following lines contain a full mutation history with three columns: taxon, site, and mutation. The mutation column data is in the format [generation]-[ancestral allele]>[derived allele] . For example, a C>T mutation which occurred in generation 1,988 would be listed as 1988-C>T .
    • //TRACKED : the following lines contain tracked mutation rates and/or motif data with five columns: lineage, generation, mutation_rate, mutation_count, and motif_count.

    The Variant Call Format (VCF) output conforms to format VCFv4.1, such as used by the 1000 Genomes Project GRCh38/hg38 release, with some notable exceptions. For convenience and reproducibility, additional metadata headers (lines beginning with ## ) are used to indicate arguments used as Trevolver input, key results, and experiment specifications, including those described in Standard Output as well as:

    • burn_in_mutations : total number of mutations during burn-in period.
    • total_mutations : total number of mutations on all branches (excluding burn-in).
    • tree_length : total branch length (generations).
    • experiment_length : tree height, i.e., root-to-tip length (generations). It is currently required that all root-to-tip lengths, measured in generations, are equal.
    • simulation_time : length of run (seconds).

    Headers that are irrelevant (e.g., non-single nucleotide variant descriptors) have been removed. The input, seed, consensus (ingroup), and MRCA (ingroup) sequences are printed in the header metadata for convenience unless --suppress_input_seq , --suppress_seed_seq , --suppress_consensus_seq , or --suppress_MRCA_seq are called. Additionally, numerous data types (many unknowable in real-life evolutionary analyses) have been added to the INFO column:

    • REF / REF_OG : the consensus (major) allele of the ingroup/outgroup(s), which may or may not match the AA (ancestral allele).
    • AA : ancestral allele of whole tree (the allele of the seed sequence).
    • MRCA : Most Recent Common Ancestor (MRCA) allele with respect to the ingroup, if --outgroups is used.
    • MUTATIONS / MUTATIONS_OG : all unique mutations that have occurred at this site, e.g., G>A . Multiple mutations are comma-separated (e.g., G>A,A>G ) in chronological order, for convenience in downstream analyses. If outgroups or excluded taxa are specified, MUTATIONS refers to only the ingroup, while MUTATIONS_OG refers to only the outgroup(s).
    • GENERATIONS / GENERATIONS_OG : time (generation) at which the unique mutation(s) occurred, comma-separated in the same order (chronological). If outgroups or excluded taxa are specified, GENERATIONS refers to only the ingroup, while GENERATIONS_OG refers to only the outgroup(s).
    • TAXA / TAXA_OG : number of taxa which share the unique mutation(s), comma-separated in the same order (chronological). If outgroups or excluded taxa are specified, TAXA refers to only the ingroup, while TAXA_OG refers to only the outgroup(s).
    • ARBITRARY_REF : flag indicating there was a tie for the major/consensus/most common allele at this site. If the highest allele frequency is shared by two alleles, the one that comes first alphabetically is reported.
    • MULTIHIT / MULTIH_OG : flags indicating a site has experienced more than one mutation, in either the same or a distinct lineage, in the ingroup/outgroup(s).
    • MULTIALLELIC / MULTIA_OG : flags indicating a site has multiple minor (non-reference) alleles (i.e., >2 alleles) in the ingroup/outgroup(s). All multiallelic sites are multihit, but the reverse is not true.
    • BACK_MUTATION / BACK_M_OG : flag indicating a site has experienced back mutation, returning to a previous state/allele, in the ingroup/outgroup(s). All sites with back mutation have experienced multiple hits, but the reverse is not true.
    • RECURRENT_MUTATION / RECURRENT_M_OG : flag indicating a site has experienced recurrent mutation, i.e., the same change occurring multiple times independently, in the ingroup/outgroup(s).
    • INVARIANT_ANCESTRAL : flag indicating a site has no polymorphism in the ingroup leaves (extant taxa), and that the fixed state matches the ancestral allele (AA). Thus, even if a mutation occurred in the history of the site, the derived allele was lost via back mutation.
    • INVARIANT_DERIVED : flag indicating a site has no polymorphism in the ingroup leaves (extant taxa), and that the fixed state matches a derived allele that resulted from mutation. This implies that all extant ingroup sequences are descended from a mutated ancestor.
    • NO_ANCESTRAL : flag indicating that no ancestral (seed) alleles remain in the extant individuals of the ingroup. Note that this is compatible with the presence of polymorphism, so long as none of the alleles match the ancestral allele.
    • REF_OG : the consensus (major) allele of outgroup(s), which may or may not match the AA.
    • REF_OG_COUNT : count of outgroup allele matching the outgroup consensus.
    • REF_OG_AF : frequency of outgroup allele matching the outgroup consensus.
    • ALLELES_OG : comma-separated list of all alleles present in the outgroup(s).
    • ALLELE_COUNTS_OG : comma-separated listed of all allele counts for alleles present in the outgroup(s), in the same order as ALLELES_OG .
    • OG_FIXED : flag indicating a site is fixed for one allele (i.e., has no variation) in the outgroup(s). This will be true by definition if there is only only outgroup. Note that frequences of outgroup alleles can be retrieved from the ALLELE_COUNTS_OG data.
    • OG_DIVERGED : flag indicating one or more outgroup alleles differs from one or more ingroup alleles. The outgroups may be diverged from the ingroup but not fixed.
    • OG_SHARE : flag indicating one or more outgroup alleles matches one or more ingroup alleles. The outgroups may share alleles with the ingroup but not be fixed.

    If you have questions about Trevolver, please click on the Issues tab at the top of this page and begin a new thread, so that others might benefit from the discussion. Common questions will be addressed in this section.

    Trevolver was written with support from a Gerstner Scholars Fellowship from the Gerstner Family Foundation at the American Museum of Natural History to C.W.N. (2016-2019), and is maintained with support from a 中央研究院 Academia Sinica Postdoctoral Research Fellowship (2019-2021). The logo image was designed by Mitch Lin (2019) copyright-free DNA helix obtained from Pixabay. Thanks to Reed A. Cartwright, Michael Dean, Dan Graur, Ming-Hsueh Lin, Lisa Mirabello, Sergios Orestis-Kolokotronis, Michael Tessler, and Meredith Yeager for discussion.

    When using this software, please refer to and cite:

    Nelson CW, Fu Y, Li W-H. Trevolver: simulating non-reversible DNA sequence evolution in trinucleotide context on a bifurcating tree. bioRxiv doi: https://doi.org/10.1101/672717

    If you have questions about Trevolver, please click on the Issues tab at the top of this page and begin a new thread, so that others might benefit from the discussion.

    Other correspondence should be addressed to Chase W. Nelson:

    • Haller BC, Messer PW. 2019. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Molecular Biology and Evolution36:632–637.
    • Hudson RR. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics18:337–338.
    • Yang Z. 2014. Molecular Evolution: A Statistical Approach. New York, NY: Oxford University Press.


    Watch the video: Evolution Simulator The Life Engine (September 2022).


Comments:

  1. Duer

    What words ... super, wonderful thought

  2. Zolorr

    It's a pity that I can't speak now - I'm late for the meeting. But I will return - I will definitely write what I think.

  3. Beorht

    I read a lot about this topic today.

  4. Laibrook

    I totally agree.

  5. Faujinn

    Granted, his idea brilliantly

  6. Bar

    the Excellent answer



Write a message