We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I'm trying to predict the protein secondary and 3-D structure for the sequence [Q1NN20] and need some help getting the ball rolling.
I'm getting confused with how and when to use Jpred, swiss-mod, and PDB. So far, I got the sequence from scanprosite, put it into swiss-mod, but what next? When does jpred come in?
Use the PDB to identify structures that are similar to the one you have found (you can use BLAST to search the PDB). A 30% match or above is usually acceptable, and multiple alignments are of course useful at lower match scores. If structures exist that are similar enough, you can use homology modelling to generate a 3D structure (this is what the SWISS-MODEL server does, and I think it automates the BLAST alignments from PDB). If there are no similar structures, you can resort to ab initio modelling if you have a reasonably straight forward globular domain, otherwise you might need to draw in additional expertise.
There are so many specific factors that need to be taken into account for a project like this. Depending on what the model will be used for, different questions are relevant.
You will need a few different secondary structure predictors to make it convincing and then you will need to check that your secondary consensus matches your 3D structure. If they don't match you need to think about "why not?"
If there is a difference it usually indicates that some 3D interaction is involved in the folding, so your 3D model usually trumps the sequence structure predictions, but do check your 3D structure against the secondary structure predictions.
Secondary structure prediction is more useful when a 3D structure is not available and modelling one is not an option.
Swiss model is an online tool for modelling protein tertiary and Quaternary structure using evolutionary information. J-pred and Swiss model both are pretty straight forward tools which requires only the sequence. Swiss model requires searching for a template and based on which the protein will be modeled further.J=pred is exclusively used for secondary structure prediction.
Applied Mycology and Biotechnology
Manoj Bhasin , G.P.S. Raghava , in Applied Mycology and Biotechnology , 2006
5. Protein Structure Prediction
Knowledge of protein three-dimensional structure or tertiary structure (3D) is a basic prerequisite for understanding the function of a protein. Currently, the main techniques used to determine protein 3D structure are X-ray crystallography and nuclear magnetic resonance (NMR). In X-ray crystallography the protein is crystallized and then using X-ray diffraction the structure of protein is determined. Determination of 3D structure by X-ray crystallography is not always straightforward and sometimes takes as much as three to five years. NMR is another useful technique to determine the protein structure. The advantage of NMR over X-ray crystallography is that the protein can be studied in an aqueous environment that may resemble its actual physiological state more closely. The main limitation of NMR is that it is only suitable for small proteins that have less than 150 amino acids. The gap between known protein sequences and the known protein structure is increasing exponentially. Thus, there is a need to develop the computational techniques to predict protein structures. Computer-aided protein conformation/tertiary structure prediction could facilitate i) the prediction of tertiary structures for proteins with known sequences and unknown structures, ii) understanding of protein folding, iii) engineering of proteins so that new functions may be incorporated, and iv) drug designing.
The problem of protein structure prediction has been approached through three main routes: 1) computer simulation based on empirical energy calculations, 2) knowledge based approaches using information derived from structure-sequence relationships from experimentally determined protein 3-D structures and iii) hierarchical methods. Each approach has its merits and limitations.
5.1. Energy Minimization Based Methods
Protein structure predictions based on energy minimization methods are rooted in observations that native protein structures correspond to a system at thermodynamic equilibrium with a minimum free energy. Energy-based methods do not make a priori assumptions about the coding properties of amino acids. Rather attempts to locate the global minimum in surface free energy of the protein molecule is assumed to correspond with the native conformation of the molecule. Methods based on the principle of energy minimization can be classified broadly in two categories i) static minimization methods and ii) dynamical minimization methods. The major software packages based on energy minimizations are AMBER CHARMS ECEPP and GROMOS ( Pearlman et al. 1995 van Gunsteren and Berendsen 1990 Brooks et al. 1990). Energy calculations offer the advantage of being based on physicochemical principles but are hampered by the large number of degrees of freedom to be considered and the limited performance of energy functions. There are essentially two major problems with methods based on energy calculations. First, the computations required for assigning protein structure based on energy minimization are beyond the reach of presently available computers. Secondly, the interaction potentials used for such calculations are not good enough to model the native structure of a protein at atomic detail ( Somorjai 1990 ).
5.2. Knowledge Based Approaches
5.2.1. Homology modeling
Presently, homology modeling is the most powerful method for predicting the tertiary structure of proteins in cases where a query protein has sequence similarity to a protein with known atomic structure. ( Blundell et al. 1987 Sali et al. 1990 Sutcliffe et al. 1987) . These methods are based on the observation that structures are more conserved than sequences. Therefore, an accurate molecular model of a protein may be constructed by assigning a conformation that is based on sequence alignment, followed by model building and energy minimization. Due to the availability of plentiful genome sequence data, the number of protein sequences is increasing at an exponential rate, and the gap between the number of sequences and their corresponding structures is widening. Therefore, construction of protein models is becoming an increasingly important technique ( Orengo et al. 1992) . The first crucial step in homology modeling involves generation of a structure-based alignment between the query protein and the sequence with known three-dimensional structure ( Pascarella and Argos 1992 ). For cases of low homology (less than 20 % identity) the quality of the optimal alignments produced by automatic methods is often poof. A conceptually different approach to homology modeling is based on distance geometry. In this prospective, the tertiary template restrictions are translated into distance restraints that are used as input for distance geometry programs ( Havel and Snow 1991 Sali and Blundell 1993 ). Homology-based modeling approaches fail in the absence of homologous structures.
5.2.2. Threading Approach
The concept of threading protein sequences through alternative folding motifs involves the construction of misfolded model structures, where an incorrect sequence is deliberately built onto the backbone of another protein. Threading a sequence through a fold requires a specific alignment between the amino acid sequence of the protein under consideration and the corresponding amino acid residue positions of the folding motif. The known structure establishes a set of possible amino acid positions in three-dimensional space. The query sequence is made similar to the known structure by placing its amino acids into their aligned positions. The primary aim of these methods is to select the most probable fold for a given sequence or to recognize suitable sequences that might fold into a given structure. The threading method is normally applied only on proteins whose amino acid sequences accept one of the protein folds previously studied by experimental techniques. The success of threading depends on the number of available folds whose structures are known at a level of atomic detail. In cases the atomic structure of folds are known then a query protein sequence can fitted with the known fold.
5.3. Hierarchical Approach
An alternate strategy for prediction of protein structures from their amino acid sequences uses the hierarchy of protein structure from primary to secondary and secondary to tertiary. An intermediate step in understanding the relationship between amino acid sequence and tertiary structure is to predict an intermediate state such as the secondary structure of a protein. This procedure involves constructing a model for the secondary structure from amino acid sequence data and use of the secondary structure model to build a tertiary structure prediction. There are a number of algorithms that have been developed for secondary modeling of proteins. Presently available methods can be classified into i) statistical methods, ii) physiochemical methods, (iii) artificial intelligence (AI) based methods, vi) evolutionary information based methods, and v) combinatorial methods ( Rost 1996 Mcguffin et al. 2000 Cuff et al. 1998) . Unfortunately, the prediction accuracy of secondary structures from sequence information is only about 80%. In using secondary structure models to predict tertiary structures attempts have been made to predict tight-turns and super secondary structures in addition to helices, turns, sheets and strands ( Kaur and Raghava 2003a Kaur and Raghava 2003b Kaur and Raghava 2004 ).
5.4. Benchmarking of Structure Prediction Methods
A major problem in the field of protein structure prediction is to assess the performance of existing methods. Methods have been developed using different sets of proteins and using different criteria for evaluation. In order to assist the developers and users, an open world wide experiment was initiated in 1994 called the Critical Assessment of Techniques for Protein Structure Prediction (CASP), CASP experiments aim to establish the current state of the art in protein structure prediction by identifying what progress has been made and highlighting where future efforts may be most productively focused. These activities are held in alternate years, and the sixth CASP was initiated in December 2004 ( http://PredictionCenter.llnl.gov/casp6 ). In addition to CASP, a number of other experiments were initiated to assess the performance of structure prediction methods such as the Critical Assessment of Fully Automatic Structure Prediction Servers (CAFASP), and the Evaluation of Automatic protein structure predictions (EVA). These experiments allow evaluation of online web servers for protein structure prediction. Table 8 lists major software and web servers for protein structure prediction.
Table 8 . A list of major software packages for protein structure prediction.
|Software Program||Use or Function||URL (Reference)|
|PHD||A method for sequence analysis and structure prediction||http://www.embl-heidelberg.de/predictprotein/predictprotein.html Rost 1996 .|
|APSSP2||Advanced protein secondary structure prediction server.||http://www.imtech.res.in/raghava/apssp2/|
|P si P red||Allows prediction of protein secondary structure, topology of transmembrane domains and fold prediction.||http://bioinf.cs.ucl.ac.uk/psipred/ Mcguffin et al. 2000 .|
|J PRED||A consensus method for predicting protein secondary structure.||http://www.compbio.dundee.ac.uk/∼www-jpred/ ( Cuff et al. 1998)|
|B ETA TP EED 2||Predicts beta turns in proteins from multiple alignments using neural networks.||http://www.imtech.res.in/raghava/betatpred2 Kaur and Raghva 2003a .|
|G AMMA P RED||Predicts gamma turns in proteins from multiple alignments using neural networks.||http://www.imtech.res.in/raghava/gammmapred Kaur and Raghava 2003b .|
|A LPHA P RED||Predicts alpha turns in proteins from multiple alignments using neural networks.||http://www.imtech.res.in/raghava/alphapred Kaur and Raghava 2004 .|
|SWISS-MODEL||An automated comparative protein modeling server.||http://www.expasy.org/swissmod/SWISS-MODEL.html Peitsch et al. 1995 .|
|GEN03D||Automatic modeling of protein three-dimensional structures.||http://geno3d-pbil.ibcp.fr/ Combet et al. 2002 .|
|CPH MODELS||Fold recognition/homology modeling.||http://www.cbs.dtu.dk/services/CPHmodels/|
|Meta Fold Recognition Server||Allows submission to multiple servers.||http://bioinfo.pl/Meta/ Ginalski et al. 2003 .|
|HMMSTR||Predicts the secondary, local, super secondary, and tertiary structures of proteins from sequences.||http://www.bioinfo.rpi.edu/∼bystrc/hmmstr/server.php Bystroff and Shao 2002 .|
|AMBER||A set of molecular mechanics force fields for the simulation of biomolecules.||http://amber.scripps.edu/ Pearlman et al. 1995 .|
|CHARMS||A set of programs for molecular simulation.||( Gunsteren and Berendsen 1990 ).|
Automated 3D RNA structure prediction using the RNAComposer method for riboswitches
Understanding the numerous functions of RNAs depends critically on the knowledge of their three-dimensional (3D) structure. In contrast to the protein field, a much smaller number of RNA 3D structures have been assessed using X-ray crystallography, NMR spectroscopy, and cryomicroscopy. This has led to a great demand to obtain the RNA 3D structures using prediction methods. The 3D structure prediction, especially of large RNAs, still remains a significant challenge and there is still a great demand for high-resolution structure prediction methods. In this chapter, we describe RNAComposer, a method and server for the automated prediction of RNA 3D structures based on the knowledge of secondary structure. Its applications are supported by other automated servers: RNA FRABASE and RNApdbee, developed to search and analyze secondary and 3D structures. Another method, RNAlyzer, offers new way to analyze and visualize quality of RNA 3D models. Scope and limitations of RNAComposer in application for an automated prediction of riboswitches' 3D structure will be presented and discussed. Analysis of the cyclic di-GMP-II riboswitch from Clostridium acetobutylicum (PDB ID 3Q3Z) as an example allows for 3D structure prediction of related riboswitches from Clostridium difficile 4, Bacillus halodurans 1, and Thermus aquaticus Y5.1 of yet unknown structures.
Keywords: 3D structure RNA RNAComposer Riboswitches Structure prediction c-di-GMP-II riboswitch.
SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity
Motivation: Accurately predicting protein secondary structure and relative solvent accessibility is important for the study of protein evolution, structure and function and as a component of protein 3D structure prediction pipelines. Most predictors use a combination of machine learning and profiles, and thus must be retrained and assessed periodically as the number of available protein sequences and structures continues to grow.
Results: We present newly trained modular versions of the SSpro and ACCpro predictors of secondary structure and relative solvent accessibility together with their multi-class variants SSpro8 and ACCpro20. We introduce a sharp distinction between the use of sequence similarity alone, typically in the form of sequence profiles at the input level, and the additional use of sequence-based structural similarity, which uses similarity to sequences in the Protein Data Bank to infer annotations at the output level, and study their relative contributions to modern predictors. Using sequence similarity alone, SSpro's accuracy is between 79 and 80% (79% for ACCpro) and no other predictor seems to exceed 82%. However, when sequence-based structural similarity is added, the accuracy of SSpro rises to 92.9% (90% for ACCpro). Thus, by combining both approaches, these problems appear now to be essentially solved, as an accuracy of 100% cannot be expected for several well-known reasons. These results point also to several open technical challenges, including (i) achieving on the order of ≥ 80% accuracy, without using any similarity with known proteins and (ii) achieving on the order of ≥ 85% accuracy, using sequence similarity alone.
Materials and Methods
Modeling distance restraints
Our approach to multi-template homology modeling is based on the statistical approach to homology modeling introduced by M odeller . Our software computes improved spatial restraints and calls the M odeller software, which then reads in the restraints and finds a structure that optimally satisfies these restraints. We briefly recall M odeller ’s approach of homology modeling here.
M odeller ’s maximum likelihood approach to homology modeling.
M odeller proceeds in two steps to compute a model structure for a query sequence that is aligned to a set of templates with known structures. In the first step, it generates a list of hundreds of thousands of restraints for the distance between pairs of atoms in the query, based on the distance of corresponding atoms in the templates. E.g. if residue i of the query q is aligned to residue i′ of a template t and similarly j is aligned to j′, then the distance d between the Cα atoms of residues i and j in q will be restrained to be similar to the known distance dt between the Cα atoms of residues i′ and j′ in t (Fig 1). In statistics, a restraint is described as a probability density function p(d), and in M odeller this distance restraint is modelled by a Gaussian function with mean dt. The standard deviation of the Gaussian describes the expected deviation of the distance d from dt. Distance restraints are generated for each pair of residues (i, j) for which aligned residues i′ and j′ exist and for various combinations of atom types, for which equivalent atoms exist in the aligned template residues, e.g. Cα − Cα, N − O, Cα − Cγ etc.
In the second step, M odeller uses stochastic optimisation to find the model structure for the query sequence that maximises the likelihood. The likelihood is the probability of the data, i.e. the alignment and template structures, given the model structure. When a single template is used for modeling, M odeller approximates the likelihood as the product of the probability density functions over all restraints. Although this approximation corresponds to assuming the independence of all restraints, it has turned out to work well in practice.
Sali and Blundell  observed that the expected deviation d − dt depended on (1) the fraction of identically aligned residues between the two sequences, (2) the average solvent accessibility of the two aligned residue pairs (i, i′) and (j, j′), (3) the average distance of i, i′, j and j′ from a gap, and (4) the distance dt. They modelled the standard deviation of the Gaussian restraint as functions of the four discretized variables. To fit these functions, they analysed a large set of structurally aligned, homologous proteins for which they measured the distances d = dij and dt = di′j′ between equivalent atoms in two pairs of structurally aligned residues, (i, i′) and (j, j′). Four different functions are trained, one for each of the following combinations of atom types: Cα − Cα, N − O, side chain—main chain, side chain—side chain.
New distance restraints that account for alignment errors.
Because the analysis in  relied on structurally alignable residue pairs in structure-based alignments, they were basically free of alignment errors and therefore the distance in the query was always similar to the distance in the template. In practice, the sequence alignment will contain errors and i and i′ (or j and j′) might not be homologous to each other. In this case, dt does not contain information about d and may be vastly different. When the pairs of residues (i, i′) and (j, j′) are sampled from real sequence alignments, this may lead to a stark deviation of the distance distribution from a Gaussian.
Fig 2(A)–2(C) shows distributions of log(d) − log(dt) for sets of residue pairs (i, i′) and (j, j′) sampled from alignments with successively lower quality. In Fig 2A only very reliable alignments have been sampled, with a posterior probability (pp) for (i, i′) and (j, j′) to be correctly aligned larger than 0.9 and with a sequence similarity (sim) above 0.75 bits per aligned pair. (See the Supporting Information for the definition of pp and sim.) Consequently, the empirical density distribution over log(d) − log(dt) has a single peak and is well fitted by a single Gaussian. However, when the alignment quality deteriorates, as shown in Fig 2B and 2C, a second component in the distribution manifests itself. It stems from residues (i, i′) and (j, j′) for which either (i, i′) or (j, j′) or both are not homologous. These data points thus contribute a background distribution that does not depend on the distance dt in the template.
The background component originates from pairs of residues with an alignment error. The plots show the empirical distribution of log d − log dt = log dij − log di′j′ for thousands of sampled pairs of residues (i, i′), (j, j′) from real, error-containing pairwise sequence alignments generated with HH align . The two-component Gaussian mixture distribution predicted by the mixture density network in Fig 3B is plotted in red. From (A) to (C), the reliability of the alignments at (i, i′) and (j, j′) (as measured by pp and sim values) decreases. Consequently, the weight of the background component increases at the expense of the signal component. (D) Same as (C) but showing the distribution of N − O distances instead of Cα − Cα distances.
These observations motivated us to model the restraint function p(log d∣ log dt, pp, sim) = p(log d∣θ) using a two-component Gaussian mixture distribution (see Fig 3A) whose means, standard deviations and mixture weight w depend on θ = (log dt, pp, sim) or θ′ = (pp, sim): (1) The mixture weight w(θ) can be regarded as the probability that both (i, i′) and (j, j′) are correctly aligned. Locally unreliable alignments will lead to a stronger background component and hence to softer distance restraints. Note that, because distances cannot be negative, they are not well modelled by Gaussian distributions, whose left tail can penetrate into the negative domain. We therefore modeled the distribution of log d instead of d.
(A) Illustration of the two-component Gaussians mixture distribution in Eq (1). (B) Mixture density network to predict the parameters (w, μ, σ, μbg, σbg) of the Gaussian mixture distribution given the three variables θ = (log dt, pp, sim) (dt: distance in template, pp: posterior probability for both aligned residue pairs to be correctly aligned, sim: sequence similarity). Since the background component does not depend on dt, the nodes for μbg and σbg are only connected to the two lowest hidden nodes that are not connected to log dt.
Mixture density networks.
To predict the five parameters of the Gaussians mixture distribution in Eq (1), we trained four mixture density networks , one for each combination of atom types listed above. A mixture density network is a special kind of neural network that learns the optimum adaptive functions for predicting the parameters of a Gaussian mixture distribution. It is trained by maximizing the likelihood of a set of training data that consists of the input features together with the value log d whose distribution should be learned. We used the R package netlabR to implement a mixture density network with five hidden nodes as illustrated in Fig 3(B). As input features we used θ = (log dt, pp, sim). The local alignment quality pp(i, j) and the global BLOSUM62 sequence similarity sim are parsed from the output of HH align in the hh-suite package , a widely used software for remote homology detection and sequence alignment (see Fig 8, green points). The set of three features was obtained by starting from a more redundant set of alignment features described in Table B in S1 Text and successively eliminating features whose omission did not significantly deteriorate the likelihood on the training set (in particular probability and raw score).
Combining restraints from multiple templates.
When several templates cover residues i and j of the query, the restraints on the distance d of atoms in residues i and j from those templates have to be combined. Multiplying the restraint functions as probability theory would suggest (see below) would not work in M odeller ’s case. When one of the restraints is wrong due to an alignment error, for instance, the restraint function of the incorrect restraint would severely distort the model structure, because the probability density of its single-component Gaussian falls off very fast for increasing distance from its mean, which effectively forbids any gross violation of the restraint. Therefore, M odeller resorts to a heuristic to estimate the probability density p(d∣d1, d2) resulting from the restraints of two templates t1, t2 with corresponding distances d1 and d2: It adds both probability densities p(d∣d1) and p(d∣d2) (Fig 4A) using some weights: (2) Here s1 and s2 measure the average sequence similarity in the sequence neighbourhoods around the two pairs of aligned residues from q and t1 and from q and t2, respectively. The optimum functions α(s1), α(s2) were found by training on a large number of structurally aligned triplets of proteins q, t1, t2 .
(A) In M odeller , two restraints functions (green and blue) are additively mixed with mixing weights that have to be learned on a set of triples of aligned protein structures. (B) Our new restraints are multiplied instead of being added. The background component ensures that the restraint function becomes constant and the restraint thus becomes inactive (i.e. ignored) when the distance d is far from the distance in the template. (C) M odeller ’s additive mixing leads to a total restraint function that is wider than any of the single-template restraints, not narrower as it should. (D) The multiplication of restraints functions according to probability theory leads to the desired behaviour of the total restraint function becoming more pointed with each restraint. Note that our new restraints are expressed as odds instead of densities (see also Eq 6).
This heuristic approach leads to undesirable behaviour, as illustrated in Fig 4A and 4C. According to elementary statistical principles, a restraint function for a distance d based on restraints from multiple templates should contain more information and be more sharply resolved than any single-template restraint function. However, the additive mixture density restraint in Eq (2) is wider, not narrower, than any single restraint.
The new two-component distance restraints allow us to apply the rules of probability to combine the information from the two templates. By Bayes’ theorem we obtain (3) If the information in the templates was approximately conditionally independent given d, i.e., p(d1, d2∣d) ≈ p(d1∣d) p(d2∣d) we would obtain (4) where Bayes’ theorem was applied to each factor in the second step.
In practice, the query and templates are related to each other through evolution along phylogenetic trees, and conditional independence cannot be assumed. We therefore approximate the dependence among the templates by weighting their odds ratios, with weights wk ∈ [0, 1]. This method is analogous to weighting sequences according to their similarity with other sequences in a multiple sequence alignment in order to compute a sequence profile  or some other family-dependent features . We will derive a method to determine optimal template-specific weights wk in the following subsection. The previous formula can then be generalised to K templates, giving (5) Here, p(d) is the probability independent of any template, i.e., the background distribution . According to Eq (1), the restraint functions are now (for the sake of brevity we omit θ and θ′) (6)
Note that the ratio of the two Gaussians is again a Gaussian, because subtracting two quadratic functions of d again yields a quadratic function. Fig 4B and 4D illustrate how restraints from multiple templates are combined under our new statistical approach and that this leads to the expected desirable behaviour of the total restraint restraining more strongly than the one-component restraints.
Dividing by the background has two effects: first, it prevents the background to become dominant when the individual background components of all P(d∣dk) are multiplied. Second, the negative logarithm of M odeller ’s distance restraint is quadratic in d, and hence unsatisfiable restraints can lead to extreme values during optimization. Dividing by the background avoids this quadratic increase because P(d∣dk)/P(d) has flat tails where it approaches a constant (1 − w). In cases of incorrect alignments with a wrong distance dt in the template, the restraint will not disrupt the query’s model structure as d will be pulled away from dt into the flat region of the restraint. Combining two component distance restraints as shown in Fig 4D thus reinforces consistent restraints while avoiding distortions from incorrect restraints.
Running M odeller with the new distance restraints.
After having picked a set of templates, we run the M odeller (version 9.10) automodel.homcsr(0) command that generates a file with the list of restraints from the query-template alignment. We parse the list of restraints and replace each template-dependent distance constraint (which is either a Gaussian function for a single-template restraint or a Gaussian mixture for a multi-template restraint) with a set of our own distance restraints, one for each template. For this purpose, we added a restraint function that computes the logarithm of Eq (6) to M odeller . All template-independent restraints such as main chain and side chain dihedral angle restraints, bond lengths etc. are left unchanged. We run M odeller with the modified restraints list to generate a 3D model.
As a motivation for the template weighting scheme, consider the case shown in Fig 5A. Giving all three templates the same weight ignores the dependencies described by the tree . Template t3 should get a weight of 1, since conditioned on q it is independent of the other two templates. But templates t1 and t2 should get weights clearly smaller than 1, since they do not contribute independent information to d. On the other hand, they are not identical and hence should receive a weight clearly larger than 0.5. But how do we compute the exact optimum weights wk for templates 1, …, K given a phylogenetic tree with known edge lengths?
(A) Templates t1 and t2 are closely related and should be down-weighted with respect to t3. (B) Any tree with a structure at an internal node with unknown distance dh to which all templates are connected in a star-like topology (top) can be transformed into an equivalent tree (bottom) with star-like topology, where equivalence means that the restraint on the distance d0 of the top node is the same for both trees. τ1, … τK indicate evolutionary distances. (C) Iterative restructuring of a phylogenetic tree. In each step, the basic transformation from Fig 5B is applied to the subtree colored in blue. Weights and edge lengths get updated until all templates are directly connected to the query.
We begin by rooting the phylogenetic tree at the query, and giving its leaf nodes initial weights of 1. By iteratively applying the elementary step in Fig 5B to subtrees, we can transform a tree with arbitrary topology into a tree with a star-like topology, as shown in Fig 5C. At each step, one inner node is removed and the procedure continues until all template leaves are directly connected to the query. At each step, we simply need to update the template weights to obtain the final weights wk for the star-like tree. In the star-like tree which we finally obtain, all template distances dk are conditionally independent, and hence we obtain for the odds ratio the result in Eq 5, using the final weights wk from this iterative process.
For the elementary step, we will show that the upper (sub)tree in Fig 5B yields exactly the same odds ratio for d0 as the transformed, star-topology tree below, (7) if the new weights are chosen according to (8) The updated weights are proportional to the old wk with a proportionality factor approaching 1 for τ0 ≪ τk. The sum of weights over all K templates is , which goes to 1 for τ0 ≫ max<τk>, signifying that in this case the information in the templates is completely redundant.
To show that the odds ratio in Eq (7) is conserved when transforming the tree into in Fig 5B, we integrate over the unknown, hidden distance dh, (9) and apply Eq (5) to the second term in the integrand, (10)
We now make the very reasonable assumption that the evolution of the distance between pairs of atoms manifests diffusive behaviour. This behaviour results if the change in distance can be modelled by many small, independent changes, each change being the consequence of a sequence mutation that will slightly change the protein structure. Concretely, this means the probability of observing a distance dl after an evolutionary time τkl, when in the ancestor the distance was dk, is given by (11) with some rate constant γ. Note that at time τkl = 0 the standard deviation vanishes and the right hand-side becomes equal to the delta functional, as it should. Substituting the conditional probabilities in the integral with these expressions, we see that the integral is over a product of Gaussians and can be solved analytically by the method of completing the square (see Suppl. Material). This results in a Gaussian distribution which is shown in the Supporting Information to be equivalent to the tree with transformed weights given by Eq (8).
For simplicity, we use the UPGMA algorithm  to construct the initial tree . The distances are computed as dist(tk, tl) = −log(TMscorepred(tk, tl)), where TMscorepred is the TM score  predicted by a neural network similar to the one in the next subsection (Supplemental Fig. S1), but without the experimental resolution as input feature. The tree constructed in this way is subsequently rearranged so that the query q is at its root.
Note that by its construction the final tree with star-like topology has the same edge lengths between the query and any template as the real tree. This is important, since the restraint function for template tk from the mixture density network depends on the similarity between q and tk. In order for the new star-like tree to be equivalent to the real one, it has to represent the same pairwise q − tk similarities as the real tree.
Single template selection.
HH search ranks templates by the probability Phom for the template to be homologous to the query protein. To pick the template best-suited for homology modeling, we trained a simple neural network with three hidden nodes (Supplemental Fig. S1) on the training set (see Results). The network predicts the TM score  of the model built with the query-template alignment, given various alignment features described in Table B in S1 Text. The idea is similar to , who proposed a neural network (NN) for picking the first template. We tried several feature combinations and, similar to previous work described in , found that the following features yielded the best results: HH search raw score, secondary structure similarity score divided by query length, expected number of correctly aligned target residues divided by query length, resolution of template structure in Angstroms. For each query, we picked the protein with highest predicted TM score among all proteins found by HH search as the first template.
Multiple template selection.
Picking the right set of templates for homology modeling is a difficult problem. The main beneficial effect of adding more templates is to increase the number of residues for which distance restraints can be generated . However, picking too many templates can decrease the model quality because, as we discussed in the context of how M odeller ’s restraints work, even a single bad template that gives rise to wrong distance restraints can severely distort the resulting 3D model.
To our knowledge, no theoretically well founded strategy for multi-template protein homology modeling has been developed so far, which contrasts with its widespread use in virtually every successful prediction pipeline. Contrary to single template selection, picking further templates is fundamentally complicated by complex dependencies between all selected structures. Current methods are therefore based on heuristics [23–25]. Some methods [26, 27] build a set of models based on several different template lists and then post-select a final model according to some quality measure .
As a simple baseline approach to multiple template selection, we employ the network of the previous section to select the first template. Further templates are added if 1) their predicted TM score is at least 90% of the first template, 2) they are structurally similar to the first template (TM align score > 0.7) and 3) all selected templates are structurally similar to each other (pairwise TM align score > 0.8).
Next, we propose here a heuristic method which aims to optimise the trade-off between increasing the query sequence coverage and decreasing the restraint quality of already covered residues due to adding more diverged templates with less reliable alignments.
We select the set of templates from among the top 100 found by HH search in the following way (Fig 6). The first template t1 is selected by the neural network that predicts the TM score . For each template in the template list (lower dashed box in the figure) a score S(t) in (see Eq 14) is (re)calculated that rewards a high coverage while penalising the addition of templates whose alignment quality is worse than that of already selected templates. The template with highest score (t4 in Fig 6) is added to the selected set if its score is still positive. The process is iterated until no template is left in that has a positive score.
We introduced SOLart as the first structure-based solubility prediction method, which is able to predict quickly and accurately the protein solubility of a protein from its experimental or modeled 3D structure.
SOLart employs a series of features, among which the sequence-based features that are commonly used for solubility prediction and some classical structure-based features such as secondary structure composition and solvent accessibility. In addition, it takes advantage of the potentiality of solubility-dependent statistical potentials to discriminate the residue interactions that favor aggregation or solubility. Besides the distance potentials that have previously been analyzed ( Hou et al., 2018), 10 new solubility-dependent potentials were introduced here, which describe the local propensities of residues to adopt certain backbone torsion angle domains or to have certain solvent accessibility values in soluble or aggregation-prone proteins. Note that the feature importance analyses show that the torsion, solvent accessibility and distance potentials are the most important features in the random forest regression prediction. The folding free energy differences computed with these potentials are better correlated with solubility than other protein properties analyzed in the literature such as protein length, isoelectric point and aliphatic index.
The performances of SOLart are high and robust: the linear correlation on both the training dataset (in cross-validation) and on three independent test sets almost reaches 0.7 on good-resolution X-ray structures and slightly lower on modeled structures. Moreover, using relaxed sequence identity cutoffs between test and training set proteins and between any pair of proteins of the same set almost does not change the scores, as shown in Supplementary Table S3 . Furthermore, SOLart performs similarly in the training and testing datasets, which again indicates its robustness and absence of bias toward the training set.
Finally, SOLart outperforms the state-of-the-art solubility predictors on an independent dataset containing S.cerevisiae proteins, with an increase of 0.1 up to 0.5 in the correlation coefficient between the predicted and the experimental values of the solubility. This provides a strong demonstration of SOLart’s accuracy and usefulness.
Another advantage of SOLart is its fastness: it is able to predict the solubility of a medium-size protein in less than one minute. This quality make this tool a perfect instrument to investigate protein solubility properties on a large scale.
It is important to underline that SOLart can be used with modeled structures, as it largely expands the domain of applicability of our tool. Indeed, whereas 36% of the proteins from the Esol E . c o l i and Esol S . c e r e v i s i a e datasets have an experimental PDB structure, this percentage increases up to about 68% if one considers in addition protein structures modeled by homology.
As an example of promising application of SOLart, let us mention rational antibody design, where the solubility issue is frequently a major bottleneck, and one can take advantage of the high quality of homology modeling applied to antibody structures.
Even though SOLart performances are good, there is still a lot of work needed to unravel the various effects and to understand the biophysical mechanisms underlying solubility and aggregation. One direction is to design better energy functions that describe more efficiently these phenomena by enlarging the protein datasets with experimental solubility values or modifying their original formulation. For example, the definition of the reference state that is adequate for solubility properties is still an open problem. It has been argued that interactions between unfolded conformations could lead to insoluble aggregates and, indeed, inclusion bodies forming in heterologous expression in E.coli have been shown to involve folded, unfolded, misfolded and partially folded proteins ( Martínez-Alonso et al., 2009 Baneyx and Mujacic, 2004 Singh and Panda, 2005 Singh et al., 2015 Vallejo and Rinas, 2004), which makes it challenging to disentangle the characteristics contributing to its formation.
Note also that the definition of the solubility ( S ) used in this article differs from the physical definition of solubility ( S 0 ), measured in g/l, defined as the concentration of a protein in a saturated solution that is in equilibrium with a solid phase. To get insights into the relation between these two solubility definitions, they should systematically be compared. This is currently impossible as no large datasets of S 0 values are available due to the difficulties in its experimental measurement.
A final perspective concerns industrial biotechnological applications, in which water is replaced by other polar solvents or even by non-polar solvents. Understanding how the protein solubility changes according to the type of solvent and being able to accurately predict this change is a major target for computational tools. On the same footing, it would also be important to understand and predict the influence of buffer salts and ionic strength on the solubility properties of proteins.
In summary, SOLart is a new and efficient method to predict protein solubility. Thanks to its user-friendly interface, both expert and non-expert users can use its predictions to analyze and improve the solubility properties of targeted proteins involved in biotechnological processes, where solubility is frequently a major bottleneck.
How to use swiss-mod to predict the secondary structure and 3D structure of a protein? - Biology
Severe acute respiratory syndrome coronavirus 2, is a positive-sense, single-stranded RNA coronavirus. It is a contagious virus that causes coronavirus disease 2019 (COVID-19).
We modelled the full SARS-CoV-2 proteome based on the NCBI reference sequence NC_045512 and annotations from UniProt.
The results are available here.
We integrated the identification of transmembrane proteins for templates and transfer that information to models.
Define, view and share your own annotations in UniProt space to view on models and structures in SWISS-MODEL Repository
Select homology models and experimental structures from SWISS-MODEL, Repository and tools to compare in one multiple structure view
When you publish or report results using SWISS-MODEL, please cite the relevant publications:
- Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., Heer, F.T., de Beer, T.A.P., Rempfer, C., Bordoli, L., Lepore, R., Schwede, T. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46(W1), W296-W303 (2018).
- Bienert, S., Waterhouse, A., de Beer, T.A.P., Tauriello, G., Studer, G., Bordoli, L., Schwede, T. The SWISS-MODEL Repository - new features and functionality. Nucleic Acids Res. 45, D313-D319 (2017).
- Guex, N., Peitsch, M.C., Schwede, T. Automated comparative protein structure modeling with SWISS-MODEL and Swiss-PdbViewer: A historical perspective. Electrophoresis 30, S162-S173 (2009).
- Studer, G., Rempfer, C., Waterhouse, A.M., Gumienny, G., Haas, J., Schwede, T. QMEANDisCo - distance constraints applied on model quality estimation. Bioinformatics 36, 1765-1771 (2020).
- Bertoni, M., Kiefer, F., Biasini, M., Bordoli, L., Schwede, T. Modeling protein quaternary structure of homo- and hetero-oligomers beyond binary interactions by homology. Scientific Reports 7 (2017).
Predicting protein structure from cryo-EM data
Proteins play an important role in many crucial biological processes, and determining their structure is a critical step to understand their functionality: the structure of a protein dictates whether and how it can interact with other molecules. Researchers can then use this structural information, for instance, to assist in the development of new drugs and vaccines. Predicting protein structure is, however, a challenging problem and it has been an active research topic for many years.
In a recently published work, Dong Si and colleagues take advantage of cryoelectron microscopy (cryo-EM) data to predict the structure of proteins. Cryo-EM, a 2017 Nobel prize-awarded technology, has gained popularity for capturing 3D maps of macromolecules at an incredible near-atomic resolution. The authors propose a tool called DeepTracer, which takes as input a protein’s cryo-EM map and amino acid sequence, and outputs its all-atom structure using a tailored deep learning framework. Different from other cryo-EM model determination methods, DeepTracer has the advantage of performing multichain prediction, requiring no manual processing steps, and achieving more accurate results.
The proposed method relies on a convolutional neural network that consists of four U-Nets, each of them designed to predict a specific structural aspect: the locations of amino acids, the location of the backbone, the secondary structure elements, and the amino acid types. A series of fully automated post-processing steps are then applied to the outputs of these U-Nets to ultimately return the predicted final structure. When compared to state-of-the-art methods (for example, Phenix, Rosetta and MAINMAST), the authors demonstrated that DeepTracer has a better accuracy: for instance, when compared to Phenix using a set of coronavirus-related data, it improved coverage (the proportion of residues that have a matching interpreted residue) in over 30%, and it decreased the root-mean-square deviation value by more than 0.40 Å. In addition, the tool was shown to be computationally efficient when running on a graphics processing unit (GPU): as an example, the tool traced a cryo-EM map containing approximately 60,000 residues within two hours. Overall, DeepTracer is an exciting new method for protein prediction that will certainly help move the field forward.
Get full journal access for 1 year
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Mukherjee, S., Szilagyi, A., Roy, A. & Zhang, Y . Genome-wide protein structure prediction. in Multiscale Approaches to Protein Modeling (ed. Kolinski, A.) Ch. 11, 255–279 (Springer, 2010).
Koonin, E.V., Wolf, Y.I. & Karev, G.P. The structure of the protein universe and genome evolution. Nature 420, 218–223 (2002).
Kelley, L.A. & Sternberg, M.J.E. Protein structure prediction on the web: a case study using the Phyre server. Nat. Protoc. 4, 363–371 (2009).
Mao, C. et al. Functional assignment of Mycobacterium tuberculosis proteome by genome-scale fold-recognition. Tuberculosis 1, 93 (2013).
Lewis, T.E. et al. Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. Nucl. Acids Res. 41, D499–D507 (2013).
Fucile, G. et al. ePlant and the 3D data display initiative: integrative systems biology on the world wide web. PLoS ONE 6, e15237 (2010).
Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—round X. Proteins 82 S2: 1–6 (2014).
Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5, 725–738 (2010).
Arnold, K., Bordoli, L., Kopp, J. & Schwede, T. The SWISS-MODEL Workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22, 195–201 (2006).
Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).
Lobley, A., Sadowski, M.I. & Jones, D.T. pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics. 25, 1761–1767 (2009).
Raman, S. Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins 77 (suppl. 9), 89–99 (2009).
Källberg, M. et al. Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012).
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2012).
Jones, D.T. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 292, 195–202 (1999).
Canutescu, A.A. & Dunbrack, R.L. Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci. 12, 963–972 (2003).
Jefferys, B.R., Kelley, L.A. & Sternberg, M.J. Protein folding requires crowd control in a simulated cell. J. Mol. Biol. 397, 1329–1338 (2010).
Rotkiewicz, P. & Skolnick, J. Fast procedure for reconstruction of full-atom protein models from reduced representations. J. Comput. Chem. 29, 1460–1465 (2008).
Wei, X. & Sahinidis, N.V. Residue-rotamer-reduction algorithm for the protein side-chain conformation problem. Bioinformatics 22, 188–194 (2006).
Arjun, R., Lindahl, E. & Wallner, B. Improved model quality assessment using ProQ2. BMC Bioinformatics 13, 224 (2012).
Davis, I.W. et al. MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic acids Res 35 (suppl. 2), W375–W383 (2007).
Schmidtke, P., Le Guilloux, V., Maupetit, J. & Tufféry, P. Fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic acids Res 38 (suppl. 2), W582–W589 (2010).
Porter, C.T., Bartlett, G.J. & Thornton, J.M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic acids Res 32 (suppl. 1), D129–D133 (2004).
Yates, C.M., Filippis, I., Kelley, L.A. & Sternberg, M.J. SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. J. Mol. Biol. 426, 2692–2701 (2014).
Capra, J.A. & Singh, M. Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875–1882 (2007).
Higurashi, M., Ishida, T. & Kinoshita, K. PiSite: a database of protein interaction sites using multiple binding states in the PDB. Nucleic Acids Res. 37 (Database issue): D360–D364 (2009).
Marchler-Bauer, A. et al. CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res 41 (D1): D348–D352 (2013).
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Sim, N. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic acids Res. 40 W1: W452–W457 (2012).
González-Pérez, A. & López-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 88, 440–449 (2011).
Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F. & Jones, D.T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337, 635–645 (2004).
Siew, N., Elofsson, A., Rychlewski, L. & Fischer, D. MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics. 16, 776–785 (2000).
Wass, M.N., Kelley, L.A. & Sternberg, M.J. 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res. 38, W469–W473 (2010).
Jones, D.T. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 3, 538–544 (2007).