Information

Standard datasets for testing new multiple sequence alignment algorithms?


Are there any open and freely distributed standard datasets for testing new algorithms for multiple sequence alignment?


I would suggest you PAcAlCI or Prediction of Accuracy in Alignments based on Computational Intelligence, though the acronym in wierd the tool is good for testing new Sequence Alignments. They

But before you start testing your algorithm, I suggest take a look at these papers:

[1] Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

[2] Issues in bioinformatics benchmarking: the case study of multiple sequence alignment

PS: I haven't tested or used the tool myself.


Kalign 3: multiple sequence alignment of large datasets

Kalign is an efficient multiple sequence alignment (MSA) program capable of aligning thousands of protein or nucleotide sequences. However, current alignment problems involving large numbers of sequences are exceeding Kalign’s original design specifications. Here we present a completely re-written and updated version to meet current and future alignment challenges.

Kalign now uses a SIMD (single instruction, multiple data) accelerated version of the bit-parallel Gene Myers algorithm to estimate pairwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences. The new version maintains high alignment accuracy on both protein and nucleotide alignments and scales better than other MSA tools.

The source code of Kalign and code to reproduce the results are found here: https://github.com/timolassmann/kalign.


Materials and Methods

We source 200 alignments chosen randomly from the chordate section of TAED ( Roth et al. 2005). This database consists of gene families identified by all-against-all BLAST searches, which are then subject to a range of phylogenetic analyses, including a scan for adaptive evolution. Given the comprehensive nature of the database, we apply strict quality controls on our data to ensure our sets of sequences are of comparable quality to those used in genomic scans of positive selection ( Kosiol et al. 2008).

First, we remove fragmentary sequences (defined as sequences <50% of the alignment length) and any families whose sequences contain nonstandard amino acids or nonterminal stop codons because these are indicative of poor quality sequences. To ensure sufficient evolutionary information to infer trees and adaptive evolution, we remove all families with fewer than 10 sequences. Furthermore, we remove families that contain no human sequence. This latter restriction is to provide a common source of reference for all analysis of adaptive evolution. To ensure families are sufficiently conserved for testing for adaptive evolution, we remove all families where the ML tree inferred from a Muscle amino acid alignment has an internal or external branch with length >0.5.

Aligners and Algorithms

For ClustalW, Muscle, ProbCons ( Do et al. 2005), ProbAlign ( Roshan and Livesay 2006), MUMMALS ( Pei and Grishin 2006), DIALIGN-TX ( Subramanian et al. 2008), and T-Coffee, default options were used to align each of the 200 sets of amino acid sequences. The ––genafpair and ––maxiterate = 2000 options were used for MAFFT. For Prank, the recommended +F option was used on both amino acid sequences (labeled as Prank (AA)) and the corresponding codon sequences (Prank (Cod)). BAli-Phy runs were produced for both the amino acid (model WAG+Γ) and codon (model M0) data, with a separate random seed each time. The first 1,000 generations of each run were discarded as burn-in, and multiple runs were produced until at least 30,000 post burn-in generations had been produced, which were then pooled together for analysis. The mean number of post burn-in generations produced were 69,000 (codon) and 65,000 (amino acid). The maximum a posteriori (MAP) estimate of the BAli-Phy alignment was taken from the pooled data as a point estimate, with a second point estimate produced by posterior decoding (PD). To investigate and account for alignment uncertainty, we retained 20 independently spaced BAli-Phy MCMC samples. Pairwise distances between MSAs were measured using the devol metric of MetAl ( Blackburne and Whelan 2012), which labels all bases by position in their sequence, and all gaps by their location in the sequence and inferred indel event on a phylogenetic tree. The devol metric is then defined as the normalized symmetric difference between all base–base and base–gap pairs of two alternative MSAs. The tree used to label indels was taken from the RAxML WAG+Γ tree on the Muscle alignment. Weighted principal co-ordinate analysis (PCoA also known as multidimensional scaling) plots were produced using the “wcmdscale” routine of the R Vegan library ( Oksanen et al. 2011 R Development Core Team 2011).

To test whether the two classes of MSAM for each family are significantly different, we take as a test statistic the distance, d, between the center of gravity (centroid) for each of the alignment classes. This approach may be considered similar to measuring the difference between the means of two classes of MSAM. To compute the centroids, we first use PCoA to project the pairwise distances between MSAs on to a Euclidean space, using as many dimensions as necessary to preserve the real distances. The centroid for any given class of MSAs is the mean of the vectors locating the MSAs on the PCoA axes, and the distance between the two centroids can be simply computed. To assess the significance of the test statistic, we take a bootstrap approach. For each of 5,000 bootstrap replicates, we randomly reassign MSAs to the two classes maintaining their relative size. The distance between the centroids for each pseudoclass, ⁠ , is then computed and treated as a sample from the distribution under the null hypothesis of the two classes not being different. The position of d in the distribution of d* is then used to assign a P value. Note that the two sets of 20 BAli-Phy MCMC samples are not included in this analysis.

Sequence Analysis on Fixed Alignments

Trees were inferred exclusively from amino acid MSAs. Each alignment was translated to amino acids (if necessary), and RAxML ( Stamatakis 2006) was used to infer a phylogenetic tree estimate under the WAG+Γ model ( Whelan and Goldman 2001). The distances between trees were calculated using GeoMeTree ( Kupczok et al. 2008) using the option to find exact distances, except for one case where the calculation did not finish in reasonable time and approximate distances were used. Adaptive evolution was inferred using PAML ( Yang 2007). Amino acid alignments were back translated into codon alignments using Pal2Nal ( Suyama et al. 2006) where necessary. The M7 and M8 models were used to test for positive selection. M7 models ω (the ratio dN/dS) by a beta distribution between 0 and 1 (allowing for no positive selection). M8 adds a category to the mixture model of ω > 1, allowing a proportion of sites in the alignment to evolve under positive selection. Parameters for M7 and M8 were each optimized three times using PAML on the codon alignment and corresponding tree, and the likelihood ratio test (LRT) statistic compared to a to determine a P value. Alignments with P values below 0.05 were assessed using the Bayes empirical Bayes (BEB) method to yield probabilities of column-wise positive selection ( Yang et al. 2005). BEB is used to calculate a posterior probability that an alignment column is best represented by the M8 model of evolution, incorporating a prior to the model parameters to allow for uncertainty in the parameter estimates. To allow comparison of sites inferred to be under adaptive evolution between alternative MSAMs, the column-wise BEB scores were translated into a sitewise probability of adaptive evolution on the longest human sequence. Sites above a BEB cutoff of >0.5 were regarded to correspond to positive selection ( Yang et al. 2005). The data were plotted using R and the “gplots” library.

Incorporating Alignment Uncertainty When Detecting Adaptation

Any given MSA is a point estimate and may include columns that have limited support given the MSAM. These low confidence columns may provide positive support for adaptive evolution and, if a point estimate is used, lead to high rates of false positives. Correcting for MSA uncertainty may therefore be important when performing the M7/M8 LRT. We take two statistical approaches for incorporating MSA uncertainty. The first is an approximation to a marginal likelihood approach, which is a classical likelihood approach ( Edwards 1978) and treats the MSA as a nuisance parameter by integrating across all possible MSAs, whereas the second is a profile likelihood approach.

Marginal Likelihood

Profile Likelihood

Incorporating Alignment Uncertainty When Identifying Sites under Adaptation

Incorporating alignment uncertainty for sitewise BEB analysis is straightforward under the profile likelihood approach, where the BEB analysis from the alignment with the highest likelihood under M8 is used for our inference. No software is available for inferring BEB posterior probabilities across a set of MSAs, and it would be challenging to develop and implement such an approach. Instead, we obtain approximate bounds for the BEB posterior probabilities under the marginal likelihood approach. Nonzero posterior probabilities are assigned for cases where the LRT produces sufficient support for M8 relative to M7 (P < 0.05), with the lower-(upper-) bound taken as the lowest (highest) BEB posterior probability from our 20 BAli-Phy samples. We also take a mean BEB value as the mean BEB probability across all 20 alignments. For sitewise analyses, we allocate putative “true positives” (pTPs) to be those obtained from our lower bounds of the marginal likelihood approach, which is expected to be highly conservative because it requires strong support for the adaptive evolution model across a range of MSAs and then takes the most conservative BEB analysis from those MSAs.


Methods

We ran two experiments in this study. The first experiment evaluated PASTA+BAli-Phy on 1000-sequence datasets in comparison to other alignment methods, and the second experiment evaluated UPP using PASTA+BAli-Phy to compute the backbone alignment and tree in comparison to other alignment methods on 10,000-sequence datasets. All datasets are available from prior publications.

Algorithms

BAli-Phy is a method that uses Gibbs sampling to alternately sample a new alignment, followed by a new phylogeny, each proportional to their likelihood under its sequence evolution model. Unlike standard phylogenetic models, such as the Generalized Time Reversible (GTR) model [20] in which only substitutions occur, the stochastic models in BAli-Phy, RS05 and subsequently RS07, also have indels. The resulting set of simulated phylogeny-alignment pairs constitutes an estimate of the joint posterior distribution. BAli-Phy does not have a well-defined stopping rule, and will run indefinitely until it is terminated. Hence, to compute a single MSA using BAli-Phy, it is necessary to define a stopping rule and a method for extracting the final alignment. In the study presented here, BAli-Phy was stopped after 24 hours of running independently on all 32 cores of a single node on the Blue Waters computing facility at UIUC [21]. Once completed, the posterior decoding (PD) alignment was computed using the alignment-max command within BAli-Phy and designated as the output alignment. The PD alignment is obtained by scoring each column in the sample alignments according to how often it appears, and choosing the set of columns that a) constitutes a valid MSA on the data and b) has the largest cumulative score possible. We chose the PD alignment because prior studies have shown that the PD alignment was more accurate than the MAP (maximum a posteriori) alignment [12, 22].

For all experiments described in this paper, we use “BAli-Phy" to specifically refer to the protocol described above for computing a multiple sequence alignment from a given input, using BAli-Phy v2.3.6. No restrictions or starting data were provided to the software commands for its execution, as well as for computation of the PD alignment, are provided in Additional file 1.

MAFFT is a well known method for multiple sequence alignment that has been consistently one of the top performing methods in terms of alignment accuracy on both nucleotide and amino acid benchmarks [12, 23]. MAFFT has many ways of being run, but its most accurate settings, such as using the local pairs (MAFFT L-INS-i) command, are computationally very intensive on large datsets. MAFFT run in default mode will select the variant to run based on the dataset size, but will not typically have the same high accuracy as when run using the local pairs command.

PASTA is an algorithm for large-scale multiple sequence alignment that has several algorithmic parameters that can be set by the user, but also has default settings, which we now describe. PASTA operates by initializing an alignment, then iteratively estimating a maximum likelihood (ML) tree using FastTree-2 [24] on the alignment, estimating an alignment with the help of this tree, and repeating. The calculation of the new alignment given the current tree is obtained using a specific divide-and-conquer strategy, wherein the tree is broken into subtrees through repeatedly deleting centroid edges until each subtree has a small enough number of sequences (the default maximum size is 200). Then, the preferred multiple sequence alignment method (default is MAFFT L-INS-i) is used to align each subset, yielding a set of subset MSAs. Then, every pair of subset alignments that are adjacent to each other in the tree are merged into a larger alignment using a profile-profile alignment technique (default is OPAL [25]). This produces a set of larger subset alignments that overlap and agree pairwise in all homologies for those sequences that they share and enables an alignment on the entire set to be computed using transitivity. The number of times this process iterates can be set by the user, but the default is three. As shown in [18], PASTA improves on both SATé [12] and SATé-II [17] in terms of accuracy and scalability to large datasets.

PASTA variants: PASTA has default settings as described above that were selected for use with MAFFT L-INS-i as the subset aligner. However, PASTA can be used with any MSA method as the subset aligner. In this paper, we examine the effect of using BAli-Phy instead of MAFFT L-INS-i within PASTA. In order to implement this, some additions to the infrastructure within PASTA were necessary. See Additional file 1 for details.

Because BAli-Phy requires 24 hours and a 32-core server to run whereas MAFFT L-INS-i runs on 200 sequences in a matter of minutes, replacing MAFFT L-INS-i for the initial iterations when the subsets are effectively (more) random would have been a poor use of expensive computing resources. We therefore chose to implement it by running PASTA in default mode (which involves three iterations), and then performing the fourth iteration using BAli-Phy as the subset aligner. Because BAli-Phy is able to run on datasets with 100 sequences, we set the decomposition size to 100 instead of 200, which is the default setting. All other parameters were run in default mode. The two natural lines of inquiry with the tests were therefore (a) does the fourth iteration using BAli-Phy improve the alignment compared with the result after the first three iterations (i.e., PASTA in its default settings), and if so, (b) can we be sure it is due to BAli-Phy and not simply that we used an extra iteration? To explore these questions, we tested the following three variants of PASTA:

PASTA(default): PASTA with fully default settings, which means three iterations, maximum subset size 200, with MAFFT L-INS-i as the subset-aligner, and OPAL to align pairs of subset alignments. We denote this by P(default).

PASTA+BAli-Phy: PASTA with three iterations under default settings, followed by one iteration with maximum subset size 100 and BAli-Phy as the subset aligner. (Equivalently, the final iteration was simply run with the phylogeny estimated in (1) specified as an input.) We denote this by P+BAli-Phy.

PASTA+MAFFT-L: PASTA with three iterations under default settings, followed by one iteration with maximum subset size 100 and MAFFT L-INS-i as the subset aligner. (Also equivalently specified as a single-iteration.) We denote this by P+MAFFT-L.

UPP is a fast multiple sequence alignment method that can be extended to 1,000,000 sequences easily, and is especially robust to fragmentary sequences compared to PASTA [4]. UPP works by choosing a random subset of (at most) 1000 sequences in the dataset to be the “backbone" and aligns those sequences with PASTA. It then constructs a collection of HMMs (called an “ensemble of HMMs") on the backbone alignment. For each of the remaining sequences, it finds the HMM from the ensemble that has the best bitscore, and uses that HMM to add the sequence to the backbone alignment. These additions are done independently, because the backbone alignment does not change during the process. UPP runs in time that is linear in the number of sequences in the input, and is also highly parallelizable. We present results using UPP with the three variants of PASTA described above to compute the backbone alignment and tree on 1000-sequence subsets of different 10,000-sequence datasets.

Maximum likelihood trees were estimated on each estimated and true 1000-sequence alignment using RAxML [26] and FastTree-2 [24], two of the most accurate methods for large-scale maximum likelihood [27]. For the 10,000-sequence datasets, we only used FastTree-2, since RAxML is too slow on such datasets. We ran RAxML and FastTree-2 in their default modes under the GTR model with gamma-distributed rates across sites.

In order to test the algorithms described above, a collection of simulated datasets used in [18] was downloaded from the authors’ website. This collection included data generated by three separate sequence evolution simulators, Indelible [28], RNASim [29], and RoseDNA [30]. Each of these simulators has distinct properties, and hence represents a unique set of simulation conditions. Two of the three (Indelible and RNAsim) included 10,000 sequences in each replicate, while the third (RoseDNA) included only 1,000. For the former, ten replicates from each simulator were used and a single set of 1,000 sequences was randomly chosen from the original.

Table 1 contains some descriptive statistics for the reference alignments of each of the 1,000-sequence simulated data. The RNAsim data are considerably different from the other two, with longer sequences and shorter evolutionary diameter, as well as many more indels of shorter length. The RoseDNA and Indelible data, on the other hand, are similar to each other, with the primary difference being the overall rate of evolution. Finally the individual RoseDNA model conditions vary chiefly with respect to the length of the indels. In all, each of the three simulators provides insight into a unique part of the data space. Detailed descriptions of the simulators and the data used are provided below.

RoseDNA is a subset of a larger collection of DNA sequences simulated using the ROSE simulator [30] that was used in [12] to evaluate SATé in comparison to other MSA methods. The ROSE simulator is a straightforward implementation of the HKY stochastic model, which is itself a close precursor to the standard Generalized Time Reversible (GTR) model [20] in use today. The simulator adds an additional model that allows the user to simulate insertions (and similarly deletions) by simulating, in order, the number of insertion events that occur, the position of each insertion followed by its length. We used 10 replicates of the 1,000-sequence datasets from the model conditions labeled 1000M1, 1000S1 and 1000L1 from [12], where the M/S/L moniker refers to the average gap length (i.e. medium, short or long, respectively) of each indel event. The specific model conditions we selected have high rates of evolution, and were selected to provide a substantial challenge to the MSA methods.

Indelible is similar to ROSE, but includes some additions that accommodate additional model complexity, such as gamma-distributed rates across sites and a codon model. The Indelible data used for these experiments are the same data used in [18], and includes only the model condition labeled M2 in the previous paper, which is the highest rate of evolution of the three that were used.

RNAsim simulates RNA sequence evolution down a tree, specifically taking RNA structure into account, and hence represents a significant departure from the previous two. It uses a population genetics model with selection to simulate sequence mutations, with selection favoring mutations with a relatively low free energy in its folded state. This is designed to emulate actual conditions that might plausibly be acting on mutations to RNA sequences, particularly those in a folded state such as ribosomal RNA. As a result, it has several major differences from the other simulators. First, there is no uniform substitution matrix used in the simulation. Second, site mutation probabilities are not independent of one another. Importantly, by contrast with the other two simulators, these differences are a departure from the likelihood model (GTRGAMMA) used in the maximum likelihood phylogeny estimation step of PASTA, and also a departure from the substitution model used by BAli-Phy. Therefore, results on the RNAsim data provide a test of the MSA method’s robustness to model misspecification, and indirectly also test the ability of GTRGAMMA maximum likelihood phylogeny estimation to be robust to substantial model misspecification.

Evaluation criteria

We explore alignment accuracy using three standard criteria: modeller score (i.e., precision), SP score (i.e., recall), and total column (TC) score, as computed by FastSP [31]. The modeller score is equivalent to 1-SPFP, where SPFP is the “sum-of-pairs false positive rate" similarly, the SP score is equivalent to 1-SPFN, where SPFN is the “sum-of-pairs false negative rate". These SPFP and SPFN error rates are based on homologies between nucleotides that appear in the true and estimated alignments [31]. The TC score is the fraction of the number of columns in the true alignment that are recovered in the estimated alignment. All accuracy criteria are given as a percentage, with 100 % indicating perfect accuracy.

We explore phylogenetic accuracy of maximum likelihood (ML) trees computed on these alignments using the Robinson-Foulds (RF) error rate, where the RF error is the percentage of the non-trivial bipartitions in the true tree that are missing from the estimated tree. We report accuracy using “Delta-RF", which is the change in the RF error rate between the ML tree computed using the estimated alignment and the ML tree computed on the true alignment. The RF error rates were calculated using DendroPy [32].


An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics

Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.

1. Introduction

Multiple sequence alignments (MSA) are an essential and widely used computational procedure for biological sequence analysis in molecular biology, computational biology, and bioinformatics. MSA are completed where homologous sequences are compared in order to perform phylogenetic reconstruction, protein secondary and tertiary structure analysis, and protein function prediction analysis [1]. Biologically good and accurate alignments can have significant meaning, showing relationships and homology between different sequences, and can provide useful information, which can be used to further identify new members of protein families. The accuracy of MSA is of critical importance due to the fact that many bioinformatics techniques and procedures are dependent on MSA results [1].

Due to MSA significance, many MSA algorithms have been developed. Unfortunately, constructing accurate multiple sequence alignments is a computationally intense and biologically complex task, and as such, no current MSA tool is likely to generate a biologically perfect result. Therefore, this area of research is very active, aiming to develop a method which can align thousands of sequences that are lengthy and produce high-quality alignments and in a reasonable time [2, 3]. Alignment speed and computational complexity are negatively affected when the number of sequences to be aligned increases. The recent advances in high throughput sequencing technologies means that this sequence output is growing at an exponential rate, the biology, landscape being punctuated by a number of large-scale projects such as the Human Genome Project [4], 1000 Genomes Project [5], and Genome 10K Project [6]. Indeed, technologies such as Roche/454 [7], Ilumina [8], and SOLiD [9] are capable of producing Giga basepairs (Gbp) per machine per day [10]. The entire worldwide second-generation sequencing capacity surpasses 13 Pbp per year (recorded in 2011) and is continuing to increase yearly by a factor of five [11]. Other large-scale data is emerging from high-throughput technologies, such as gene expression data sets, protein 3D structures, protein-protein interaction, and others, which are also generating huge sequence data sets. The analysis and storage of the growing genomic data represents the central challenge in computational biology today.

As the protein alignment problem has been studied for several decades, studies have shown considerable progress in improving the accuracy, quality, and speed of multiple alignment tools, with manually refined alignments continuing to provide superior performance to automated algorithms. However, more than three sequences of biologically relevant length can be difficult and time consuming to align manually therefore, computational algorithms are used as a matter of course [2]. Sequences can be aligned using their entire length (global alignment) or at specific regions (local alignment). Multiple sequence alignment for protein sequences is much more difficult than the DNA sequence equivalent (containing only 4 nucleotides) due to the fact that there are 20 different amino acids. Global optimization techniques, developed in applied mathematics and operations research, provide a generic toolbox for solving complex optimization problems. Global optimization is now used on a daily basis, and its application to the MSA problem has become a routine [12]. Local alignments are preferable however, they can be challenging to calculate due to the difficulty associated with the identification of sequence regions of similarity. The two major aspects of importance for MSA tools for the user are biological accuracy and the computational complexity. Biological accuracy concerns how close the multiple alignments are to the true alignment and are the sequences aligning correctly, showing insertions, deletions, or gaps in the right positions. Computational complexity refers to the time, memory, and CPU requirements. Complexity is of increasing relevance as a result of the increasing number of sequences needed to be aligned. The complexity of a primal MSA tools was always

, where is complexity, is length of the sequence, and is the number of sequences to be aligned. Until recently, this was not a problem because was always smaller than therefore, most algorithms concentrated on how to deal with lengthy sequences rather than the number of sequences, and now the situation has changed, where a lot of alignments have larger than therefore, new and more recent MSA algorithms are concentrating not only on the length of sequences but also on the increasing number of sequences [13].

A wide range of computational algorithms have been applied to the MSA problem, including slow, yet accurate, methods like dynamic programming and faster but less accurate heuristic or probabilistic methods. Dynamic programming (DP) is a mathematical and computational method which refers to simplifying a complicated problem by subdividing it into smaller and simpler components in a repeated manner. The dynamic programming technique can be applied to global alignments by using methods such as the Needleman-Wunsch algorithm [14] and local alignments by using the Smith-Waterman algorithm [15]. Up to the mid-1980s, the traditional multiple sequence alignment algorithms were only best suited for two sequences, so when it came to producing multiple sequence alignment with more than two sequences, it was found that completing the alignment manually was faster than using traditional dynamic programming algorithms [16]. Dynamic programming algorithms are used for calculating pairwise alignments (two sequence alignments) with the time complexity of . In theory, this method could be extended to more than two sequences however, in practice, it is too complex, because the time and space complexity becomes very large [17]. Therefore, producing multiple sequence alignment requires the use of more sophisticated methods than those used in producing a pairwise alignment, as it is much more computationally complex. Finding a mathematically optimal multiple alignment of a set of sequences can generally be defined as a complex optimization problem or NP-complete problem as it must identify an MSA with the highest score from the entire set of alignments therefore, heuristic (“best guess”) methods must be used.

2. Multiple Sequence Alignment Algorithms

The most popular heuristic used from which the majority of multiple sequence alignments are generated is that developed by Feng and Doolittle [18], which they referred to as “progressive alignment” [16, 18]. Progressive alignment works by building the full alignment progressively, firstly completing pairwise alignments using methods such as the Needleman-Wunsch algorithm, Smith-Waterman algorithm, k-tuple algorithm [19], or k-mer algorithm [20], and then the sequences are clustered together to show the relationship between them using methods such as mBed and k-means [21]. Similarity scores are normally converted to distance scores and guide trees are constructed using these scores by guide tree building methods such as Neighbour-Joining (NJ) [22] and Unweighted Pair Group Method with Arithmetic Mean UPGMA [23]. Once the guide tree is built, the multiple sequence alignment is assembled by adding sequences to the alignment one by one according to the guide tree, that is, the most similar sequences added first and then gradually adding more distant sequences. Unfortunately, this heuristic has a greedy nature that is, it only looks at two sequences at a time and ignores the remaining data and therefore cannot guarantee an optimal solution. Also, if mistakes are made in the initial stages of the alignment, they cannot be fixed in later stages, and the mistake will continue throughout the alignment process with the problem worsening as the number of sequences increases. Progressive alignment is the foundation procedure of several popular alignment algorithms such as ClustalW [24], Clustal Omega [21], MAFFT [25], Kalign [26], Probalign [27], MUSCLE [13], DIALIGN [28], PRANK [29], FSA [30], T-Coffee [31, 32], ProbCons [33],and MSAProbs [34]. Different methods for producing multiple sequence alignment exist, and their use depends on user preferences and sequence length and type, as shown in Table 1.

An improved version of the progressive alignment method was developed called “iterative progressive algorithms.” These algorithms work in a similar manner to progressive alignment however, this approach repeatedly applies dynamic programming to realign the initial sequences in order to improve their overall alignment quality, also at the same time adding new sequences to the growing MSA. The iteration benefits the alignment by correcting any errors produced initially, therefore improving the overall accuracy of the alignment [35]. Iterative methods are able to give 5%–10% more accurate alignments however, they are limited to alignments of a few hundred sequences only [21]. The most used iterative alignment algorithms include PRRP [36], MUSCLE [13], Dialign [28], SAGA [37], and T-COFFEE [32, 38].

Multiple sequence alignments can also be constructed by using already existing protein structural information. It is believed that by incorporating structural information to the alignment, the final MSA accuracy can be increased therefore, most structure-based MSA are of higher quality than those based on sequence alignment only. The reason for structure-based MSA being of better quality is not due to a better algorithm but rather an effect of structures evolutionary stability that is, structures evolve more slowly than sequences [39]. The most popular structure and based MSA is 3D-COFFEE [40], and others include EXPRESSO [41] and MICAlign [42].

Motif discovery algorithms are another type of MSA algorithms that are used. These methods are used to find motifs in the long sequences this process is viewed as a “needle in a haystack” problem, due to the fact that the algorithm looks for a short stretch of amino acids (motif) in the long sequence. One of the most widely used tools for searching for motifs is PHI-Blast [43] and Gapped Local Alignments of Motifs (GLAM2) [44].

Short sequence alignment algorithms are also beginning to emerge, primarily due to advances in sequencing technologies. Most genomic sequence projects use short read alignment algorithms such as Maq [45], SOAP [46], and the very fast Bowtie [47] algorithms.

3. Top Multiple Sequence Alignment Algorithms

The number of multiple sequence alignment algorithms is increasing on almost monthly bases with

1-2 new algorithms published per month. The computational complexity and accuracy of alignments are constantly being improved however, there is no biologically perfect solution as yet. ClustalW (one of the first members of the Clustal family after ClustalV) is probably the most popular multiple sequence alignment algorithm, being incorporated into a number of so-called black box commercially available bioinformatics packages such DNASTAR, while the recently developed Clustal Omega algorithm is the most accurate and most scalable MSA algorithms currently available. ClustalW and Clustal Omega are described later, and also a brief description is provided for the T-Coffee, Kalign, Mafft, and MUSCLE multiple sequence alignment algorithms.

3.1. ClustalW

ClustalW [24] was introduced by Thompson et al. in 1994 and quickly became the method of choice for producing multiple sequence alignments as it presented a dramatic increase in alignment quality, sensitivity, and speed in comparison with other algorithms. ClustalW incorporates a novel position-specific scoring scheme and a weighting scheme for downweighting overrepresented sequence groups, with the

” representing “weights.” Firstly, the algorithm performs a pairwise alignment of all the sequences (nucleotide or amino acid) using the k-tuple method by Wilbur and Lipman [19] which is a fast, albeit approximate, method or the Needleman-Wunsch method [14] which is known as the full dynamic programming method. These methods calculate a matrix which shows the similarity of each pair of sequences. The similarity scores are converted to distance scores, and then the algorithm uses the distance scores to produce a guide tree, using the Neighbour-Joining (NJ) method [22] for guide tree construction. The last step of the algorithm is the construction of the multiple sequence alignment of all the sequences. The MSA is constructed by progressively aligning the most closely related sequences according to the guide tree previously produced by the NJ method (see Figure 1 for an overview).


ClustalW algorithm, which works by taking an input of amino acid or nucleic acid sequences, completing a pairwise alignment using the k-tuple method, guide tree construction using the Neighbour-Joining method, followed by a progressive alignment to output a multiple sequence alignment.
3.1.1. Pairwise Alignment

The k-tuple method [19], a fast heuristic “best guess” method, is used for pairwise alignment of all possible sequence pairs. This method is specifically used when the number of sequences to be aligned is large. The similarity scores are calculated as the number of k-tuple matches (which are runs of identical residues, usually 1 or 2 for protein residues or 2–4 for nucleotide sequences) in the alignment between a pair of sequences. Similarity score is calculated by dividing the number of matches by the sum of all paired residues of the two compared sequences. Fixed penalties for every gap are subtracted from the similarity score with the similarity scores later converted to a distance score by dividing the similarity score by 100 and subtracting it from 1.0 to provide the number of differences per site.

Then, all of the k-tuples between the 2 sequences are located using a hash table. A dot matrix plot between the two sequences is produced with each k-tuple match represented as a dot. The diagonals with the most matches in the plot are found and marked within a selected “Window Size” of each top diagonal. This sets the most likely region for similarity between the two sequences to occur. The last stage of the k-tuple method is to find the full arrangement of all k-tuple matches by producing an optimal alignment similar to the Needleman-Wunsch method but only using k-tuple matches in the set window size, which gives the highest score. The score is calculated as the number of exactly matching residues in the alignment minus a “gap penalty” for every gap that was introduced.

3.1.2. Guide Tree Construction

ClustalW produces a guide tree according to the “Neighbor-Joining” method. The NJ method is often referred to as the star decomposition method [48]. The NJ method keeps track of nodes on a tree rather than a taxa (a taxonomic category or group, such as phylum, order, family, genus, or species) or clusters of taxa. The similarity scores are used from the previous k-tuple method and stored in a matrix. A modified distance matrix is constructed in which the separation between each pair of nodes is adjusted by calculating an average value for divergence from all other nodes. The tree is then built by linking the least distant pair of nodes. When two nodes are linked, their common ancestral node is added to the tree and the terminal nodes with their respective branches are removed from the tree. This process allows the conversion of the newly added common ancestor into a terminal node tree of reduced size. At each stage in the process, two terminal nodes are replaced by one new node. The process is completed when two nodes remain separated by a single branch. The tree produced by the NJ method is un-rooted and its branch lengths are proportional to divergence along each branch. The root is placed at the position at which it can make the equal branch length on either side of the root. The guide tree is then used to calculate weight for each sequence, which depends on the distance from branch to the root. If a sequence shares a common branch with another sequence, then the two or more sequences will share the weight calculated from the shared branch, and the sequence lengths will be added together and divided by the number of sequences sharing the same branch.

3.1.3. Progressive Alignment

ClustalW’s progressive alignment uses a series of pairwise alignments to align sequences by following the branching order of the guide tree previously constructed by the NJ method. The procedure starts at the tips of the rooted tree proceeding towards the root. At each step, a full dynamic programming algorithm is used with a residue weight matrix (BLOSUM) and penalties for opening and extending gaps.

3.2. Clustal Omega

Clustal Omega is the latest MSA algorithm from the Clustal family. This algorithm is used to align protein sequences only (though nucleotide sequences are likely to be introduced in time). The accuracy of Clustal Omega on small numbers of sequences is similar to other high-quality aligners however, on large sequence sets, Clustal Omega outperforms other MSA algorithms in terms of completion time and overall alignment quality. Clustal Omega is capable of aligning 190,000 sequences on a single processor in a few hours [21]. The Clustal Omega algorithm produces a multiple sequence alignment by firstly producing pairwise alignments using the k-tuple method. Then, the sequences are clustered using the mBed method. This is followed by the k-means clustering method. The guide tree is next constructed using the UPGMA method. Finally, the multiple sequence alignment is produced using the HHalign package, which aligns two profile hidden Markov models (HMM) as shown in Figure 2.


Discussion

Many MSA programs are freely available. However, choosing the most suitable program to each dataset is not trivial. The characteristics of the sequences to be aligned, such as the shared identity, as well as their number and length, are aspects that must be assessed in every MSA dependent project. Each MSA program parameterization, such as the choice of substitution matrices and gap opening/extending penalties for example, when available, also strongly affect the final alignment [24]. Running MSA programs with default parameters are usually preferred when no information regarding the sequences to be aligned are available and/or for users without previous knowledge in this particular field of sequence analysis. With that in mind, we chose to benchmark a selection of programs mostly with their default options. Although results presented herein are compatible with current low-cost hardware and timelines of most research projects, they must be used only as guidelines, and we encourage users to carefully study each program’s parameters in order to obtain the best possible output. The BAliBASE suite is a reliable benchmarking dataset, but still might be considered small to meet certain MSA projects [21]. Thus, understanding each programs own limitations are imperative in order to generate reliable results.

As stated in related papers [21, 22], no available MSA program outperformed all others in all test cases. For the first five reference sets, our results indicated that T-Coffee, Probcons, MAFFT and Probalign were definitely superior with regard to alignment accuracy in all BAliBASE datasets, consistent with similar publications [7, 8, 21, 22]. All four programs have a consistency-based approach in their algorithms, thus being a successful improvement in sequence alignment. Despite meeting certain consistency criteria, DIALIGN-TX is based on local pairwise alignments and is known to be outperformed by global aligners [5]. Nevertheless, we observed that the consistency-based approach may not offer alone the highest quality of alignment. CLUSTAL OMEGA did well when aligning some datasets with long N/C terminal ends from full-length sequences (BB) and has no consistency. The presence of these non-conserved residues at terminal ends, on the other hand, contributed to reduce the scores in the alignments generated by T-Coffee and Probcons, which produced the highest SP/TC scores when aligning the truncated sequences (BBS). Despite having an iterative refinement step, which could improve results, Probcons is still a global alignment program, thus being more prone to alignment errors induced by the presence of non-conserved residues at terminal ends [20]. Certainly MAFFT, Probalign and even CLUSTAL OMEGA may be preferred over T-Coffee and Probcons when aligning sequences with these long terminal extensions. The combination of iterative refinement strategy with consistency from local alignments in MAFFT (L-INS-i method) might have contributed to prevent and correct the alignment of the full-length sequences [22]. Similarly, the suboptimal alignments (determined by variations of the Temperature parameter) generated by the partition function of Probalign, might as well improved the ability of this program to deal with sequences with non-conserved terminal extensions [8]. Apparently, the profile HMM of long sequences also improved the alignments produced by CLUSTAL OMEGA.

As for the remaining reference sets of BAliBASE (6, 7 and 9), we observed that the four consistency-based programs mentioned above still generated better alignments, although MUSCLE presented improved results. In some subsets of Reference 9, MUSCLE was either close or better than some of the top four SP/TC scoring programs. At this reference set, the alignment of sequences with linear motifs generated by MUSCLE might be facilitated by Kimura’s distance, the second stage in the progressive alignment of this program. The Kimura distance states that only exact matches contribute to the match score. Although fast, the method has limitations since it does not consider which changes of amino acids are occurring between sequences. This limitation may be reverted in benefit since the program, assuming the same penalty for any amino acid substitution in early steps of progressive alignment, would avoid a distance increase between pairs of close sequences with errors or wildcard residues (any amino acid) at the linear motifs.

In the largest BAliBASE datasets, the use of the multi-core capability of T-Coffee was indispensable in order to evaluate alignment accuracy because, when running in single-core mode, its computational time exceeded by far the pre-established threshold of 2.5 hours. In the biggest dataset (the last subset of Reference 9), T-Coffee took more than nine days to complete the alignment. The parallelization of T-Coffee should certainly be seen as a major improvement to an MSA program, as processing cores are growing in number even in home desktop computers, not to mention more and faster RAM modules. Interestingly, MAFFT was the only program, among the top four SP/TC scoring programs, able to align all reference sets in less than 2.5 hours with the pre-established settings described in the Methodology section. This is most likely due to the flexibility of the “auto” mode of MAFFT to choose the most appropriate method of alignment according to dataset size, changing from high accuracy mode (L-INS-i) to high speed and less accuracy mode (FFT-NS-2) [25]. Although not being the version used in this work, recent improvements in parallelization were also achieved for MAFFT [26], indicating a tendency to make full use of available hardware and reduce time of execution of MSA programs. Besides parallelization, there is still much space for improvement in the field of multiple sequence alignment in performance. E.g., CLUSTAL OMEGA implemented a modified version of mBed [27], which produced fast and accurate guide trees, and managed to reduce computational time and memory requirements to finish the alignment of large datasets. A part from performance, there also much room for accuracy improvements, as some results presented in this study were still far from the BAliBASE reference alignments.


Introduction

Multiple sequence alignment (MSA) of a set of homologous sequences is an essential step of molecular phylogenetics, the science of inferring evolutionary relationships from molecular sequence data. Errors in phylogenetic analysis can be caused by erroneously inferring site homology or saturation of multiple substitutions [1], which often present as highly divergent sites in MSAs. To remove errors and phylogenetically uninformative sites, several methods “trim” or filter highly divergent sites using calculations of site/region dissimilarity from MSAs [1–4]. A beneficial by-product of MSA trimming, especially for studies that analyze hundreds of MSAs from thousands of taxa [5], is that trimming MSAs reduces the computational time and memory required for phylogenomic inference. Nowadays, MSA trimming is a routine part of molecular phylogenetic inference [6].

Despite the overwhelming popularity of MSA trimming strategies, a recent study revealed that trimming often decreases, rather than increases, the accuracy of phylogenetic inference [7]. This decrease suggests that current strategies may remove phylogenetically informative sites (e.g., parsimony-informative and variable sites) that have previously been shown to contribute to phylogenetic accuracy [8]. Furthermore, it was shown that phylogenetic inaccuracy is positively associated with the number of removed sites [7], revealing a speed–accuracy trade-off wherein trimmed MSAs decrease the computation time of phylogenetic inference but at the cost of reduced accuracy. More broadly, these findings highlight the need for alternative MSA trimming strategies.

To address this need, we developed ClipKIT, an MSA trimming algorithm based on a conceptually novel framework. Rather than aiming to identify and remove putatively phylogenetically uninformative sites in MSAs, ClipKIT instead focuses on identifying and retaining parsimony-informative sites, which (alongside other types of sites and features of MSAs, such as variable sites and alignment length) have previously been shown to be phylogenetically informative [8]. ClipKIT implements a total of 5 different trimming strategies. Certain ClipKIT trimming strategies allow users to also retain constant sites, which inform base frequencies in substitution models [9], and/or trim alignments based on the fraction of taxa represented by gaps per site (or site gappyness). We tested the accuracy and support of phylogenetic inferences using ClipKIT and other alignment trimming software using nearly 140,000 alignments from empirical datasets of mammalian and budding yeast sequences [8] and simulated datasets of metazoans, plants, filamentous fungi, and a larger sampling of budding yeasts sequences [10–13]. We found that ClipKIT-trimmed alignments led to accurate and well-supported phylogenetic inferences that consistently outperformed other alignment trimming software. Additionally, we note that ClipKIT-trimmed alignments can save computation time during phylogenetic inference. Taken together, our results demonstrate that alignment trimming based on identifying and retaining parsimony-informative sites is a robust alignment trimming strategy.


INTRODUCTION

Sequence alignment is certainly one of the most well-developed and pervasive topics of computational molecular biology. Algorithms in this vein are widely used for tasks varying from the comparative analysis of rodent (1𠄵) and chicken (6) genomes to the construction of networks of protein interactions (7). With the current sequencing of many genomes (8), fast and sensitive sequence alignment algorithms will likely maintain or increase their role in biological research.

As more and more genomic data has become available, algorithms for locally aligning query sequences to genomic databases have become increasingly important (9�). Because the exact Smith–Waterman (14) algorithm is impractical for large sequences, database search techniques are almost always based upon the paradigm of seeded alignments. The BLAST algorithm (10) was pivotal in popularizing such a technique, and it has since been incorporated into many tools, a few of which are BLASTZ (4), BLAT (13) and Exonerate (15). In such algorithms, a set of seeds is first generated between the database and the query. Each seed is then extended to determine whether it is a part of high scoring local alignment. Extensions typically consist of two phases: first the seed is extended into an un-gapped alignment, and if this alignment scores above a threshold, the seed is then extended with the allowance of gaps. An enhancement to this simple model is to extend only pairs of seeds close to each other (11). Seeds for the BLAST algorithm are traditionally fixed-length words present in both the database and the query, with the word length referred to as the seed's weight. This leads to an inevitable speed/sensitivity trade-off heavier seeds prune a larger fraction of the search space but miss more alignments than do seeds with a smaller weight.

In recent years, the introduction of spaced seeds has led to significantly improved local alignment algorithms (12,16�). Spaced seeds allow non-contiguous patterns of matching nucleotides to initiate a local alignment, and algorithms have been developed (17�) to compute the probability that a seed will be found within an un-gapped alignment of a given length between two sequences. The optimal seed can then be chosen as the seed that maximizes this probability. It is useful to think of un-gapped alignments of homologous regions as being generated by a probabilistic model that specifies the distribution over matches and mismatches (17,20,22). The model outputs a bit string where each position corresponds to a position in the alignment the bit is 1 if there is a match in the alignment and 0 if there is a mismatch. While higher-order models are possible (19), in this article we will focus on models that output a 1 independently in each position with a fixed probability, which is called the similarity level (12).

In addition to being provably more sensitive than consecutive seeds in some cases (21), spaced seeds allow an important new speed/sensitivity trade-off. Rather than lowering the weight of a seed to boost sensitivity, one can index multiple seeds per position and obtain a linear, rather than exponential, rise in the size of the search space (18,20). Spaced seed design operates under a resource-constrained paradigm (17), where the weight and number of seeds is specified and the goal is to design an optimal set of seeds that fits these constraints.

In this article, we seek to build on these developments by taking advantage of increasing amounts of available genomic data as well as rapidly improving global multiple sequence alignment algorithms (23�). We predict that in the near future, these trends will lead to the proliferation of genomic databases consisting of multiple alignments. Information implicit in an alignment has been used to aid in a variety of bioinformatics tasks (27�), and, similarly, one can hope that a multiple alignment can be utilized to improve database search algorithms. Previous research on searching between multiple alignments has concentrated on position specific scoring schemes (11,31,32). PSI-BLAST (11) is the most popular such program given a query sequence, it builds a multiple alignment, or profile, from a set of high scoring alignments of the query to the database. It then uses the constructed profile to iterate searches for improved sensitivity. Approaches in this vein have been successful, but in this paper, our focus is orthogonal to such techniques.

The problem we tackle is to align a query sequence to a fixed multiple alignment database. As an example, it may be desirable to augment a multiple alignment of mammalian genomes with a newly sequenced mammalian or vertebrate genome. Our approach uses the multiple alignment database to improve search sensitivity over that obtained using only a sequence database. To do this, we extend the resource-constrained paradigm to apply not only to seed design but also to seed allocation we allow different positions in the database to index different sets of seeds and determine the best way to do so based on the information implicit in the multiple alignment.

We have implemented a local alignment tool, Typhon, which incorporates our indexing algorithm. Tests on real world data shows that Typhon is substantially more sensitive than standard sequence indexing algorithms as well as algorithms that index multiple alignments without using our dynamic indexing methodology. The performance improvement is most dramatic for indexes with a small number of spaced seeds, which is important for large-scale database searches. Source code for Typhon is available under the GNU public license at http://typhon.stanford.edu.


Background

Biological sequence alignment is a cornerstone of bioinformatics and is widely used in such fields as phylogenetic reconstruction, gene finding, genome assembly. The accuracy of the sequence alignments and similarity measures are directly related to the accuracy of subsequent analysis. CDS alignment methods have many important applications for gene tree and protein tree reconstruction. In fact, they are useful to cluster homologous CDS into groups of orthologous splicing isoforms [1, 2] and combine partial trees on orthology groups into a complete protein tree for a gene family [3, 4]. Aligning and measuring the similarity between homologous CDS requires to account for frameshift (FS) translations that cannot be detected at the amino acid (AA) level, but lead to a high similarity at the nucleotide level between functionnaly different sub-sequences.

FS translation consists in alternative AA translations of a coding region of DNA using different translation frames [5]. It is an important phenomenon resulting from different scenarios such as, insertion or deletion of a nucleotide sequence whose length is not a multiple of 3 in a CDS through alternative splicing [6, 7] or evolutionary genomic indels [8, 9], programmed ribosomal frameshifting [10], or sequencing errors [11]. Recent studies have reported the role of FS translations in the appearance of novel CDS and functions in gene evolution [6, 12]. FS translation has also been found to be linked to several diseases such as the Crohn’s disease [13]. The computational detection of FS translations requires the alignment of CDS while accounting for their codon structure. A classical approach for aligning two CDS used in most alignment tools [14, 15] consists in a three-step method, where the CDS are first translated into AA sequences using their actual coding frame, then AA sequences are aligned, and finally the AA alignment is back-translated to a CDS alignment. This approach does not account for alternative AA translations between two CDS and it leads to incorrect alignment of the coding regions subject to FS translation. The opposite problem of aligning protein sequences while recovering their hypothetical nucleotide CDS sequences and accounting for FS translation was also studied in several papers [16, 17].

Here, we consider the problem of aligning two CDS while accounting for FS translation, by simultaneously accounting for their nucleotide and AA sequences. The problem has recently regained attention due to the increasing evidence for alternative protein production through FS translation by eukaryotic gene families [18, 19].

The problem was first addressed by Hein et al. [20, 21] who proposed a DNA/protein model such that the score of an alignment between two CDS of length n and m is a combination of its score at the nucleotide level and its score at the AA level. They described a O(n 2 m 2 ) algorithm in [20], later improved to a O(nm) algorithm in [21] for computing an optimal score alignment, under the constraint that the search space of the problem is restricted. Arvestad [22] later proposed a CDS alignment scoring model with a O(nm) alignment algorithm accounting for codon structures and FS translations based on the concept of generalized substitutions introduced in [23]. In this model, the score of a CDS alignment depends on its decomposition into a concatenation of codon fragment alignments, such that a codon fragment of a CDS is defined as a substring of length w, (0le w le 5) . This decomposition into codon fragment alignments allows to define a score of the CDS alignment at the AA level. More recently, Ranwez et al. [18] proposed a simplification of the model of Arvestad limiting the maximum length of a codon fragment to 3. Under this model, a O(nm) CDS alignment algorithm was described and extended in the context of multiple sequence alignment [18]. In the models of Arvestad [22] and Ranwez et al. [18], several scores may be computed for the same alignment based on different decompositions into codon fragment alignments. The corresponding algorithms for aligning two CDS then consist in computing an optimal score decomposition of an alignment between the two CDS. This optimal score exclusively accounts for FS translation initiations, i.e a FS translation in an alignment is penalized by adding a constant FS cost, which only penalizes the initiation of the FS, not accounting for the length of this FS translation. However, taking account of FS translation lengths is important in order to increase the precision of CDS alignment scores, as these lengths induce more or less disruptions between the protein sequences.

In this paper, we propose the first alignment algorithm that accounts for both the initiation and the length of FS translations in order to compute the similarity scores of CDS alignments. The remaining of the paper is organized as follows. In the “Motivation”, we illustrate the importance of accounting for FS translation length when aligning CDS. In the “Preliminaries”, we give some preliminary definitions and we introduce a new CDS alignment scoring model with a self-contained definition of the score of an alignment penalizing both the initiation and the extension of FS translations. In the “Methods”, a dynamic programming algorithm for computing an optimal score alignment between two CDS is described. Finally, in the “Results”, we present and discuss the results of a comparison of our method with other CDS alignment methods for a pairwise comparison of CDS from homologous genes of ten mammalian gene families.

Motivation: importance of accounting for FS translation length

The two main goals of aligning biological sequences are to evaluate the similarity and to identify similar regions between the sequences, used thereafter to realize molecular analyses such as evolutionary, functional and structural predictions. In practice, CDS alignment can be used to exhaustively identify the conserved features of a set of proteins. Thus, the definition of CDS similarity must account for sequence conservation and disruptions at both the nucleotide and the protein levels.

Figure 1 illustrates the importance of accounting for AA translations and FS translation length in order to compute an adequate similarity score for a CDS alignment. It describes an example of three CDS Seq1 , Seq2 and Seq3 . Seq1 has a length of 45. The CDS Seq2 has length 60 and is obtained from Seq1 by deleting the nucleotide ‘G’ at position 30 and adding 16 nucleotides at the end. The CDS Seq3 has length 60 and is obtained from Seq1 by deleting the nucleotide ‘G’ at position 15 and adding 16 nucleotides at the end.

Top an example of three CDS Seq1 , Seq2 and Seq3 . Middle an optimal alignment between Seq1 and Seq2 with a FS translation region of length 15. Bottom an optimal alignment between Seq1 and Seq3 with a FS translation region of length 30

When looking at the AA translations of Seq1 , Seq2 and Seq3 , we observe that the similarity between Seq2 and Seq1 is higher than the similarity between Seq3 and Seq1 at the protein level, because Seq1 and Seq2 share a longer AA prefix “M T E S K Q P W H” (amino acids in black characters in the alignments). However, the pairwise CDS alignment algorithms that do not account for the length of FS translations would return the same score for the two following optimal alignments of Seq1 with Seq2 and Seq1 with Seq3 , penalizing only the initiation of one FS translation in both cases (positions marked with a “!” symbol in the alignments), and not penalizing the sequence disruptions at the protein level.

From an evolutionary point of view, a good scoring model for evaluating the similarity between two CDS in the presence of FS translations should then penalize not only the initiation of FS but also the length of FS translations extension (amino acids in gray characters in the alignments). The alignment of Seq1 with Seq2 would then have a higher similarity score than the alignment of Seq1 with Seq3 .

Preliminaries: score of CDS alignment

In this section, we formally describe a new definition of the score of a CDS alignment that penalizes both the initiation and the extension of FS translations.

Definition 1

[Coding DNA sequence (CDS)] A coding DNA sequence (CDS) is a DNA sequence on the alphabet of nucleotides (Sigma _N=) whose length n is a multiple of 3. A coding sequence is composed of a concatenation of (frac<3>) codons that are the words of length 3 in the sequence ending at positions 3i, (1 le i le frac<3>) . The AA translation of a CDS is a protein sequence of length (frac<3>) on the alphabet (Sigma _A) of AA such that each codon of the CDS is translated into an AA symbol in the protein sequence.

Note that, in practice an entire CDS begins with a start codon “ATG” and ends with a stop codon “TAA” , “TAG” or “TGA” .

Definition 2

(Alignment between DNA sequences) An alignment between two DNA sequences A and B is a pair ((A',B')) where (A') and (B') are two sequences of same length L derived by inserting gap symbols ('-') in A and B, such that (forall i,

1 le i le L) , in the alignment is called a column of the alignment.

Given an alignment ((A',B')) of length L between two CDS A and B, let S be the sequence (A') or (B') . We denote by (S[k

1 le k le l le L) , the substring of S going from position k to position l. (left| ight|) denotes the number of letters in () that are different from the gap symbol ('-') . For example, if (A'= exttt) and (B'= exttt) , (|A'[4

8]| = | exttt| = 3) . A codon of A or B is grouped in the alignment ((A',B')) if its three nucleotides appear in three consecutive columns of the alignment. For example, the first codon ACC of A is grouped, while the first codon ACT of B is not grouped.

An alignment ((A',B')) of length 48 between two CDS, A (13 codons) and B (14 codons). The number arrays indicate the positions of the consecutive alignment columns. Codons of A and B are colored according to the set to which they belong: IM codons in blue color, FSext codons in red color, InDel codons in green color and FSinit codons in black color. MFS nucleotides contained in FSinit codons are underlined

In the following, we give our definition of the score of an alignment ((A',B')) between two CDS A and B. It is based on a partition of the codons of A (resp. B) into four sets depending on the alignment of codons (see Fig. 2 for an illustration):

The set of In-frame Matching codons (IM) contains the codons that are grouped in the alignment and aligned with a codon of the other CDS.

The set of Frameshift extension codons (FSext) contains the codons that are grouped in the alignment and aligned with a concatenation of three nucleotides that overlaps two codons of the other CDS.

The set of Deleted/Inserted codons (InDel) contains the codons that are grouped in the alignment and aligned with a concatenation of 3 gap symbols.

All other codons constitutes Frameshift initiation codons (FSinit) . The set of Matching nucleotides in FSinit codons (MFS) contains all the nucleotides belonging to FSinit codons and aligned with a nucleotide of the other CDS.

The following notations and conventions are used in Definition 3 to denote the different sets of codons and nucleotides in A and B. The set of IM codons in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). The set of FSext codons in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). The set of InDel codons in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). The set of MFS nucleotides in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). In these sets, the codons of A and B are simply identified by the position (column) of their last nucleotide in the alignment. In this case, we always have ( exttt_ = exttt_) as in the example below. The MFS nucleotides are also identified by their positions in the alignment.

In the alignment scoring model described in Definition 3, the substitutions of IM and FSext codons are scored using an AA scoring function (s_) such that aligned codons with silent nucleotide mutations get the same score as identity. A fixed FS extension cost denoted by fs_extend_cost is added for each FSext codon. The insertions/deletions of InDel codons are scored by adding a fixed gap cost denoted by gap_cost for each InDel codon. The alignment of MFS nucleotides are scored independently from each other, using a nucleotide scoring function (s_) . The insertions or deletions of nucleotides in FSinit codons are responsible for the initiation of FS translations. They are then scored by adding a fixed FS opening cost denoted by fs_open_cost for each FSinit codon. Note that, by convention, the values of all penalty costs for gap and FS ( gap_cost , fs_open_cost , fs_extend_cost ) are negative. Note also that the scoring scheme assumes that the AA and the nucleotide scoring functions, (s_) and (s_) , are symmetric.

Definition 3

(Score of an alignment) Let ((A',B')) be an alignment of length L between two CDS A and B. The score of the alignment ((A',B')) is defined by:


Acknowledgements

This work was supported by Science Foundation Ireland (Grant number: 07/IN.1/B1783).

Author contributions: DGH initiated the project the original ClustalW software was produced by JDT, DGH and TJG. The HHalign code was written by JS and is maintained by MR. The idea of EPA was suggested by KK. FS adapted HHalign to Clustal Omega AW devised and adapted guide tree construction routines. DD parallelised the code. AW, DD and FS carried out all software development. The EBI server was set up by RL, HMcW and WL. Benchmarking was carried out by JDT, AW, DD and FS. All authors contributed to the manuscript.


Watch the video: Αλγόριθμοι Ακολουθίας (January 2022).