Anisimova, Cannarozzi, and Liberles: Finding the balance between the mathematical and biological optima in multiple sequence alignment

Finding the balance between the mathematical and biological optima in multiple sequence alignment


Recent advances in evolutionary modelling and alignment methodology enable alignment of sequences with special features and incorporate structural and functional information. However, our reviewing experience and a recent study by Morrison1 suggest that these newer methods are under-utilized (especially in the communities of molecular systematics and experimental biology), and the resulting alignments are often curated manually. Most often, no clear biological reasoning is invoked during manual alignment; instead only aesthetic qualities are considered, as measured by eye. Such subjectivity is not consistent with core scientific principles. Although we recognize that methodological problems still exist, computerized alignment methods are currently more realistic and can model a variety of evolutionary mechanisms. We also suggest future directions for the further improvement of automatic alignment methods based upon disconnects of existing methods with underlying biological mechanisms.

Alignment as a statement of homology

Multiple sequence alignment (MSA) is central to standard bioinformatics pipelines for comparative genomic and systematic analysis. An alignment is ultimately a statement of homology, so that each column in the alignment is thought to have descended from a common ancestral state in evolutionary history. This intimately links MSA to other downstream methods of evolutionary inference, including phylogenetic tree construction.

Evolutionary history and shared common origins of columns in an MSA are assessed by similarity of characters of sequence features. Indeed, dating back to Zuckerkandl and Pauling2 and the origins of molecular clocks, it was realized that sequences diverge with time and that sequence identity or similarity was an indicator of sequence homology. From this, there is an expectation that sequences of increasing evolutionary distance will be increasingly divergent. The rate and patterns of divergence will be dictated by macro-evolutionary and population genetic processes and protein function and structure.3 A recent study quantifying the relationship between rates of sequence and structural divergence suggested that structure diverges three to ten times more slowly than sequence, as aspects of protein structure are critical to protein function and are thus under strong negative selective pressure.4,5 These considerations have led to the development of alignment methods based upon patterns of sequence divergence. While the first alignment algorithms were tuned to align primarily protein data6,7 alignment algorithms have recently improved, diversified and became better adapted to data other than proteins, including DNA,811 coding DNA,1216 RNA,17,18 and to sequences with special features, such as repeats, rearrangements, and promoter regions.19,20 Further advances in methodology provided bioinformaticians with more data on structural and functional features of proteins – and statistical methods were proposed to incorporate these features in alignment optimization algorithms.2125 Moreover, recent advances allow quality evaluation for each column aligned, providing valuable information for upstream analyses.2629

Sadly, our reviewer experience and a recent survey1 suggest that state-of-art alignment methods are not commonly used, but older and less accurate algorithms and their implementations are still preferred. Many empirical scientists recognize that the computational methods for sequence alignment are problematic, and often use manual alignment to adjust the alignment by eye. Overall less than 1% of surveyed papers used the best performing methods like MAFFT,30 MUSCLE31 and ProbCons,26 while 50–75% resorted to the familiar CLUSTAL,7 and manual intervention.1 Here, we discourage manual editing of alignments, on the basis of its lack of objectivity and reproducibility. We urge the greater use of recent alignment techniques, including those that incorporate a priori knowledge where it is available. We recognize the importance of optimality criteria and suggest that manual alignments should be compared objectively. A better understanding of recent methodological advances and benchmarking will facilitate the use of better alignment methods. The prevalence of the first generation program CLUSTAL may be partially due to the fact that it is embedded in many web servers.

How do alignment methods compare?

One example of alignment is provided in Figure 1. Many proteins involved in signal transduction contain the Src-homologous SH3 domain of about 60 amino acids long. To demonstrate the range of the differences between alignment methods, a collection of seven SH3 domain sequences were aligned using six programs. Figure 1A shows the alignments (PDB identifiers: 2o9s-A, 1shg, 1gfc, 1pkt, 1srm, 1pnj, 2hsp). All sequences share a core beta-barrel structure consisting of five or six strands (yellow) arranged in two beta sheets. The structures 1pnj and 1pkt also contain an insertion with some helical structure (green).

Figure 1

(A) Alignments of SH3 domains with PDB identifiers: 2o9s-A, 1shg, 1gfc, 1pkt, 1srm, 1pnj, and 2hsp.The alignments were performed with a structural aligner, TM-Align and a variety of alignment programs (Clustal W 2.1, PRANK, MUSCLE, T-Coffee, MAFFT, FSA). Beta strands are yellow while alpha-helices are green. (B) Reliability assessment of M-COFFEE. M-COFFEE provides reliability assessment as indicated by color on a scale from blue (low confidence in homology) to red (high confidence in homology). (C) Reliability assessment of FSA. FSA provides reliability assessment as indicated by color on a scale from blue (low confidence in homology) to red (high confidence in homology).


The first alignment (Figure 1A) is a structural alignment from the program TM-Align32 as implemented in STRAP.33 There are several points worth remarking on the structural alignment. The sum of pairs score for the structural alignment is negative, reflecting the differences in alignment that one gets with a structural vs. a similarity scoring criterion. Additionally, the structural assignments for identical amino acids are different in 1pkt and 1pnj. As a consequence the identical string ‘KGSLVAL’ in the insertion is not aligned. This could be an example of the difference between structural and sequence homology, the phenomenon of structure sliding along the sequence, where non-homologous positions adopt structurally identical roles whereas the homologous positions play alternative roles in stabilizing structure.

Also shown are the alignments created with default settings of Clustal W, PRANK, MUSCLE, T-COFFEE and MAFFT as implemented in STRAP and the FSA alignment from the FSA webserver. The differences in the alignments in the gap regions are noteworthy. T-Coffee and FSA create alignments with no gaps in the secondary structure. Thus the highest scoring algorithms also perform the best when aligning secondary structure (even without imposing structural constraints to guide the alignment process).

Figures 1B and 1C show the reliability assessment of M-COFFEE and FSA, obtained from their respective webservers. Both of these programs provide a color scale to assess reliability from blue (low confidence in homology) to red (high confidence in homology). Judging from the differences in the alignments, the probabilistic reliability assessment of FSA seems to provide a better view of the variability of the alignment.

Benchmarking is used to perform automated assessment of the performance of different alignment algorithms. Various benchmarking datasets have been constructed to standardize evaluation of protein alignment algorithms: BaliBASE,34 HOMSTRAD,35 PREFAB,31 and OXBench36 are based on 3D structural superposition. BaliBASE was the first large test set of protein families and is comprised of manually curated alignments, sorted into reference sets based on sequence identity, length and sequence characteristics. To avoid uncertainty in the test set, only core blocks of reliable aligned sequences are part of the reference alignments. Although this increases the reliability of the assignments, it avoids precisely the parts of the alignment that are difficult to align. Because of this problem, some benchmarking datasets (BaliBASE and OXBench) now include full length sequences as well. There are also test data sets based on simulated data (IRMBASE).37,38 The benefit of simulations is that the correct alignment is known with no uncertainty. However, the relevance of the simulated data to real sequence evolution generates uncertainty of another kind. A recent development is the use of a phylogeny criterion rather than structure-based criteria to assess alignment algorithm performance.39 In any case, the downstream use of the MSA as well as the properties of the sequence data may dictate the best performing algorithm.

The reality and the choice of the multiple sequence alignment optimality criterion

The search for optimal alignment using dynamic programming scales as a function of alignment length N and the number of sequences S (typically of O(NS) complexity), and is intractable, especially for MSAs in the genome-sequencing era. As a result, approximate algorithms or heuristic methods have been developed to rapidly align larger numbers of sequences. Of note are two classes of algorithms, divide and conquer and progressive alignments. Divide and conquer40 and related methods like POA41 subdivide the sequence into shorter fragments that are aligned and then combined to generate a global alignment. Progressive methods are the most prevalent and align sequences as a combination of pairwise alignments weighted by an underlying guide tree. The tree is traversed from leaves to root (post-order traversal of the tree), aligning the more closely related sequences first and continuing by using the information at the child nodes to construct the partial alignment at the parent nodes. Alignment methods may differ in the manner in which the two sub-alignments at child nodes are combined. Common programs like Clustal,7 MUSCLE,31 and MAFFT30 use progressive alignment as the underlying heuristic. More successful programs include an additional phase of iteration to correct mistakes created early in the progressive algorithm.

A critical part of all strategies is assessing the quality of the alignment, typically based on an optimality criterion that uses a character substitution or similarity matrix and a gap penalty scheme. The accuracy of an alignment method inherently depends on its optimality criterion, as it determines whether the alignment changes made by the heuristics may be accepted or not. Good optimality criteria are therefore critical. It has been pointed out eloquently by Kemena and Notredame42 that the design of an optimality criterion should be motivated by the biological processes responsible for generating the molecular data at hand. Morrison1 argued that the mathematical functions we optimize cannot ensure the convergence to the biological optimum as many underlying processes and features of real data are not adequately modelled, such as in the case of evolutionary scenarios with repeats, inversions, and frequent gaps. This, together with poor understanding of recent methodologies, leads researchers to resort to manual alignment.

Despite the inherent difficulties of modelling biological reality (no model is likely to ever be true), much progress has been made to improve alignment strategies and to make optimality criteria more realistic. Compared to a decade ago, we are now much closer to reaching the biological optimum, based on rigorous statistical criteria incorporating functional, structural or evolutionary views on alignment.42 Relatively recently, pair hidden Markov models (HMMs) were applied to MSAs. Thanks to posterior decoding, where the most likely hidden state when an emission occurs is selected, HMM-based methods may incorporate more complex alignment scoring schemes.26,27

We note that defining the biological optimum may be very challenging. Scientists with different backgrounds (biochemists, evolutionary biologists, structural biologists, biophysicists, population geneticists, etc.) are likely to define distinct and non-overlapping biological optima. Ultimately, however, they are all describing biological processes that reflect descent from a common ancestor.

Indeed, such a discussion has already been broached with the suggestion that the best alignment for optimizing the evolutionary signal (phylogeny-driven) may not be the same as the best alignment based upon structural criteria. The use of structural criteria as benchmarks in multiple sequence alignment may then bias towards alignments optimized for structure and energetics, but which may not always reflect evolutionary history to the extent that sequences can slide through structures during evolution to find alternative thermodynamic optima.39 The complex relationship between protein thermodynamics and population and macro-evolutionary processes including selection will need to be considered in generating an optimal scoring function as structural alignments will contain evolutionary information that is lost at the sequence level.43

Parsimony, homoplasies, and alignment

Manual editing of a computed MSA typically involves an even further minimization of the number of gaps and mismatches by eye. This is usually justified by a classical parsimony argument, but such practices disregard the underlying biological process, which may involve high degrees of complexity. While the computed alignments often are not the most parsimonious, evolution may not have generated the most parsimonious patterns of sequence diversity in aligned blocks.3,44 Treating homoplasies as alignment mistakes excludes the possibility that a complex underlying process may have generated this pattern. To describe the inherent complexity of observed data, even parsimony methods attempt to include extra layers of complexity.45

Note that parsimony is not inconsistent with the presence of homoplasies. The distribution of observed patterns is defined by the underlying process, which generates data. An alignment with frequent gaps and homoplasies may well be the simplest description of data given the heterogeneous process resulting from an interplay of different evolutionary forces acting on molecular sequences (such as selection, recombination, compensatory changes, gene conversion, and composition biases in adaptation to the environment). It is a mistake to assume that aesthetically parsimonious patterns achieved by manual editing are representative of the true simplicity, without a better knowledge of the underlying biological process. Longer alignments are often more biologically meaningful than shorter and more parsimonious ones,10,46 but would have been almost impossible to achieve by manual editing.

Three additional reasons for not trusting manually edited alignments

While it is hard to imagine accounting for complex biological forces in a manual alignment procedure, the alignment programs attempt to model the underlying biological process and use formal criteria to score the resulting alignments. While it is easy to underestimate the complexity of the alignment problem (especially for similar sequences), in computer science it has been long known as an NP-hard problem.47 The likelihood of manually making “sensible” changes to an alignment rapidly decreases with the increase of number and the length of sequences to be aligned. But even for datasets of a manageable size, an important question is what we assume to be “sensible”. The subjectivity of the manual alignment editing is one major pitfall that is inconsistent with doing proper science. Alignment algorithms use objective functions to evaluate and compare candidate alignments. Alignments obtained through subjective minimization of homoplasies without relying on an optimality criterion, are based on prejudices and open the way for introducing researcher-specific biases.

From this follows another important pitfall of manual alignment editing. The lack of statistical criteria to compare candidate alignments results in the inability to show that the manually edited alignment is significantly better than the one produced automatically. Finally, manual alignment curation is non-algorithmic and therefore not reproducible, defying one of the most important scientific criteria.

At the very least, accepting certain alignment alterations has to be done after: i) a statistical comparison of the manual alignment to other candidate alignments (for example, possible with M-coffee;48 also38,46,10 ii) the procedure for making the alignment changes has to be rigorously described and based on objective biological knowledge (for example, functional, active sites, structure elements) rather than on a parsimonious gut feeling.

One simple way of automatically curating alignment quality is to remove ambiguously aligned regions from subsequent phylogenetic analysis, which can be done with the popular program Gblocks.49 However, the effect of applying Gblocks on downstream tree accuracy is controversial.39,50 Strategically, if Gblocks is removing regions that are improperly aligned, then this is a band-aid covering the need for better models that produce better alignments. Alternatively, if Gblocks is removing properly aligned, but rapidly evolving regions, it is then introducing a bias to downstream analysis, as the most conserved sites may not be those that produce the strongest phylogenetic signal.51

The four disconnects in alignment methodology

Recent years have seen significant advances towards more biologically realistic alignments (see, for example, the review of Kemena and Notredame 2009).42 Still, there are several apparent discrepancies between the optimality criterion and the underlying biology; and a number of emerging independent trajectories were proposed in the literature to treat each such disconnect. The first disconnect involves the evolutionary process that presumably has contributed to generating the sequences in hand. However, optimality criteria used for scoring alignments are not based on evolutionary models, unlike the likelihood function in phylogenetic analyses that requires an explicit evolutionary model. This results in a disconnect: the alignment scoring function does not reflect the evolutionary process that contributed to generating the data. As better character substitution models are developed, they should also be applied to alignment optimization (see discussion below). However, our understanding of processes driving indel-formation is still poor, bringing us to the second disconnect: the lack of adequate indel models in alignment optimization. Gap penalties currently used by alignment methods are fundamentally different from the underlying probabilities of observing gaps of different lengths.52 A uniform distribution of indels is usually assumed along the length of a sequence, which is also very unrealistic.5355 Modelling indel distributions and their evolution as part of an explicit evolutionary model is limited56,57 and is currently not widely applied to alignments.

The third disconnect relates to the general lack of integration between population genetic and interspecific models. Indeed it has been shown that population genetics parameters like effective population size shape the interspecific patterns, and so affect the probability of observing different types of substitutions.43,5860

Finally, the two competing views functional or structural vs evolutionary defining criteria for optimal alignment are currently disjoint: each alignment method chooses one criterion ignoring the other, and so creating a fourth disconnect. While the evolutionary view of alignment is most common, for proteins or RNA both indels and substitutions occur in the context of a folded three-dimensional structure, where their effects on ΔGfolding will affect their likelihood of observation, also dependent upon the unknown relationship between ΔGfolding and organismal fitness. Another problem rooted in the structural underpinnings of sequence evolution is the reliance of sequence-based alignment on substitution matrices with an underlying assumption of site-independence for a process that is inherently site-interdependent. Approaches that consider and combine sequence-based and structure-based criteria may be a promising step forward. What has not yet been accomplished is integration of ΔG calculations into sequence-based models for the purposes of alignment. With different underlying assumptions, this has been discussed in the context of fitness.6164

Functional sites and sites that are critical to folding, such as active site and binding cleft residues that are absolutely conserved and cysteines responsible for structural disulphide bridges, can be pre-annotated and treated as anchors, providing a functional basis for alignment.

Better substitution models for an optimality criterion

Over the last decade much work has been done to improve models of molecular evolution. While BLOSUM65 and PAM66 matrices are still habitually used, it is time for the alignment methods to capitalize on the wealth of models available and include them in popular alignment packages. For protein-coding data the use of empirical codon matrices improves the alignment accuracy and has been implemented in alignment procedures.10,12,14 For protein or RNA data, the use of more realistic similarity matrices should equally result in better alignments (eg, see RNA-specific matrix,67 general protein matrix LG,68 or organism-specific mtArt,69 mtREV,70 mtMam,71 rtREV72). Due to among-site heterogeneity, structure or context-specific matrices have a strong potential to improve the alignment accuracy (for example, matrices for transmembrane alpha-helices,73 for combinations of secondary structures and solvent accessibility,7477 or for local sequence-structure contexts78). These matrices give advantage in homology searches (especially when searching for distant homologues), but are poorly utilized for MSA. Biegert and Söding (2009) derived sequence context-specific amino acid similarities that rely on a library of sequence contexts, instead of relying on a single substitution matrix.79 Based on this idea, a context-specific extension of BLAST (CS-BLAST) achieves a two-fold sensitivity improvement. The same idea potentially could help to improve alignment accuracy for distant sequences. In a related development, HMMs were successfully applied to describe and study the evolutionary heterogeneity of biological processes in a genomic sequence. For example, in application to G-protein-coupled receptors, models with hidden site classes were used to study the dimerization mechanisms.80 Embedded within the phylogeny-aware alignment algorithm, a two-level HMM accounts for a number of heterogeneous classes describing distinct evolutionary processes, such as different codon positions or slow and fast evolving sites.10

Better indel models

The distribution of indels and their lengths is clearly dependent on several factors: sequence divergence, location within the sequence, proximity to other indels or functionally important regions, organism-specific factors. Insertion and deletion occur most commonly in loop regions of proteins, where they are less likely to cause steric problems (geometric clashes) in protein folding. Even in the absence of a solved structure, secondary structural information can be considered in alignment methods as different amino acids have different propensities to occur in loop regions.

While typical character substitution models assume a reversible and stationary Markov process at little price, for indels such assumptions are clearly unrealistic.81,82 A simple model of affine gap penalties83 is most frequently used, because few satisfactory and computationally tractable alternatives have been proposed. The choice of penalty parameters is rather arbitrary in practice (resorting to default values in most cases), although the choice of substitution matrix and gap penalties may be optimized.84 While the indel length distribution is commonly described by a geometric (exponential) distribution, empirical studies suggested computationally more demanding solutions, such as the Zipfian distribution85 or a mixture of four exponentials.86

Despite the difficult task, recent work on indel treatment for sequence alignment should not be underestimated. Pair-HMMs6 and similar models such as transducers87,88 were proposed to make modelling of indels more realistic. While Gotoh's gap penalties and the TKF92 evolutionary model57 may be equivalently modelled by a pair-HMM, the probabilistic framework allows additional sophisticated modifications to modelling indels. For example, the “long-indel” model is an extension of TKF91 which allows indels of arbitrary length. The “long-indel” was shown to outperform both TKF models in sequence alignment and may be extended to a non-reversible process.82 Alternative models (for example, exponential decay and extending the standard Markov process to include indel rates) were proposed to describe non-reversible time-dependent indel evolution and applied to gene finding.89,81

However, further methodological advances are required to make these recent models computationally feasible for MSA inference. In a more practical development, using more realistic bi-phasic gap penalties as in ProbCons26 (gap-extension penalty is higher for shorter gaps) was shown to increase alignment accuracy. The most recent attempt to improve indel treatment in sequence alignment, with the aim of avoiding penalizing single insertion events multiple times, proposed to distinguish between insertions and deletions rather than treating them together. This ‘phylogeny-aware’ algorithm is implemented in the program PRANK, which relies on a tree to identify indel regions as insertions and deletions and treats them as such in the subsequent alignment building process. PRANK has been shown to outperform implementations of other algorithms in analysis of biological data.39 ‘Phylogeny-aware’ alignment has the potential to overcome the difficulties in aligning repeat regions identified as problematic by Morrison.1

Simultaneous estimation of alignment and phylogeny

Most alignment methods rely on a guide tree and thus may be affected by using a “wrong” tree to guide the alignment process. On the other hand, phylogenetic inference is typically conducted for a given alignment, and alignment errors may have a downstream effect on the accuracy of the inferred tree.90 Methods have been developed that enable the use of evolutionary criteria in an iterative or simultaneous assessment of alignment together with the phylogenetic tree.29,9194 For example, the POY software95 makes this assessment based upon a parsimony score. Statistically more rigorous developments include methods like BAli-Phy,92,96 StatAlign97 and BigFoot,98 which are formulated in a Bayesian framework, include models of indel evolution (TKF and extensions), and use simulated annealing99 or MCMC to jointly sample posterior distributions of alignments and trees (eg, see Table 1). As a result, such methods are computationally demanding and currently cannot be used for large datasets.

Table 1

Multiple sequence alignment software packages currently available to users (incomplete list).

Software package Characteristic description Indel model Input data Speed1 Uncertainty estimates
Standard alignment programs
Clustal The pioneering and most widely used program to construct MSAs.7
Historically very important, but is outperformed by many current alignment programs. A recent version allows for iteration.111
Affine with position-specific penalties AA, DNA Fast No
FSA A fast probabilistic approach that seeks to minimize expected distance to true alignment (the expected accuracy objective function) and constructs MSA using sequence annealing based on pairwise estimates of homology.87,27 FSA can align thousands of sequences, and also provides the capability to compare different candidate alignments provided by the user (serves as a meta-method). Pair-HMM (equivalent to double affine penalties) AA, DNA Fast-very fast Yes
T-Coffee Coffee is an objective function for computing a consistency score based on the agreement between a library of pairwise alignments of the same sequences.112 The library of alignments can also come from other alignment programs, structural information or be computed from the input sequences. T-coffee is an implementation using this objective function combined with a progressive alignment.113 No explicit penalty scheme (embedded algorithms typically use affine scheme). (like in ProbCons) AA, DNA Moderate No
POA Based on partial order alignments, a representation of an MSA as a directed acyclic graph.8,114 Partial orders can be aligned using dynamic programming using both progressive and iterative algorithms. POA includes the capability to model homologous recombination. Non-affine penalty scheme, which includes gap truncation penalty as well as the standard gap open and extension penalties. T-coffee rewards correctly placed gaps. Recent version includes double affine penalty scheme AA, DNA Fast-very fast No
PRANK A phylogeny aware alignment program that uses a phylogenetic tree to recognize insertions and deletions as separate evolutionary events.10,46 This algorithm was then extended to model regional heterogeneity and evolution.9 This algorithm performed well in the performance analysis based on phylogeny.39 Pair-HMM (equivalent to affine), avoids repeatedly penalizing the same insertion by using a tree to distinguish it from deletion DNA, AA, codons Slow-moderate Yes
MAFFT Offers a variety of alignment strategies including progressive alignment (with very efficient dynamic programming algorithm), iteration, and consistency scoring and allows for choosing from a wide spectrum of accuracy and speed.30 Mafft performs well on many benchmarks.115,39 Affine with position-specific penalties AA, DNA Very fast No
MUSCLE MUSCLE employs pairwise profile alignment with two steps of subsequent refinement.94 Affine with position-specific penalties AA, DNA Fast - Very fast No
ProbAlign Constructs MSAs by maximizing expected accuracy and using partition function methodology and a probabilistic consistency transformation scheme.116 Affine AA, DNA, RNA Moderate Yes
ProbCons Uses maximum expected accuracy algorithm, which combines probabilistic modeling and consistency-based scoring with a pair HMM-based progressive alignment algorithm.26 This algorithm performs well on structural benchmarks and simulations.115 A separate version exists for RNA (ProbConRNA). Pair-HMM (equivalent to double affine penalty scheme) AA Moderate Yes
MUMMALS Uses a probabilistic consistency objective function and pairwise alignment with structural pair-HMM that considers local secondary structure similarities.117 Pair-HMM (equivalent to affine penalty scheme) AA Moderate Yes
Template based alignment programs (using structure, profiles or other features)
R-Coffee Uses RNA structural template computed RNAlpfold and constructs an MSA having the best agreement of sequences and structures.118 No explicit penalty scheme. RNA Moderate No
3D-Coffee Combines sequences and structures to generate high-quality multiple sequence alignments119 EXPRESSO is a version of 3D-Coffee that automatically selects templates via a BLAST search against PDB.120 T-Coffee may also combine MSAs with profile information (PSI-Coffee, 3DPSI-Coffee). No explicit penalty scheme. AA, DNA Moderate No
Praline Homology extension package that uses a PSI-BLAST profile to guide the computation of an MSA.121,122 PRALINE also provides a choice of secondary structure prediction programs that can be used for integrating structural information into the alignment process. Special option in PRALINE is tailored for membrane-bound proteins, by using prediction of transmembrane regions and membrane-specific scoring matrices.123 Affine AA Moderate No
PROMALS3D PROMALS3D uses 3D structural information to guide sequence alignments constructed by PROMALS,22 which is a consistency-based aligner that uses libraries generated with pair-HMM posterior decoding strategy. Sequences in PROMALS are associated with a PSI-BLAST profile. Yes??
Meta methods
Guidance Guidance score tests the consistency of every MSA column obtained from guide trees with respect to a set of MSAs, which is shown to be a good predictor of unreliably aligned regions.29 No explicit penalty scheme DNA, AA, codons Moderate (depends on embedded methods) Yes, HoT or Guidance score
M-Coffee A meta-method, which uses other alignment methods to construct an alignment library and then uses T-coffee to combine these alignments.48 No explicit penalty scheme DNA, AA Moderate (depends on embedded methods) Yes, consensus-based
Joint estimation of alignment and phylogeny
BAli-Phy Joint Bayesian estimation employing Markov Chain Monte Carlo to sample trees, alignments and evolutionary model parameters.92,96 Pair-HMM (equivalent to affine gap penalties) DNA, AA Very slow Yes
StatAlign Joint Bayesian estimation employing Markov Chain Monte Carlo with fast transition kernels to sample trees, alignments and evolutionary model parameters.97 Modified TKF92 DNA, AA Very slow Yes
BigFoot An extension to the StatAlign package, BigFoot performs phylogenetic footprinting by modeling quickly and slowly evolving regions and their breakpoints.98 Extension of TKF92 via HMM transducer DNA Very slow Yes
ALIFRITZ Employs a strategy based on simulated annealing to infer the tree, the alignment TKF92 implemented via triplet HMM and the history of insertions and deletions.99 DNA, AA Very slow-slow No
Genomic aligners
MLAGAN A multiple genome aligner that uses progressive alignment guided by a user-specified tree and based on a sum of pairs metric.106 It is based on LAGAN, a pairwise genome aligner which first maps local alignments to the genome and then uses the map in a global alignment phase. MLAGAN assumes a given species tree. Affine A set of DNA contigs Fast No
Enredo/Pecan Enredo finds colinear segments of the input genomes and treats both genome rearrangements and duplications. It is used in conjunction with Pecan, which then aligns full genomes using a consistency objective function.110 Pecan uses maximum expected accuracy criterion. Pair-HMM (equivalent to double affine penalty scheme) A set of chromosomes, each represented by linear or circular strings of double-stranded DNA Moderate Yes
Ortheus A probabilistic approach that infers the evolutionary history of a multiple sequence alignment in terms of substitutions, insertions and deletions and so constructs an ancestral MSA.110 Ortheus takes a regular MSA as an input but does not rely on a single fixed alignment. Models indel history using a transducer A phylogeny and an MSA Moderate Yes

1 Summarized from evaluations in (87,39,124,110,125,108)

Consequently, it appears tempting to use approximate schemes such as the iterative approach of Saté,91 which at each step of the iteration attempts to improve the alignment and the tree using the best alignment algorithms and ML tree estimation by RAxML.100 Such an approach however is controversial since the tree-building step has no indel model and thus will introduce biases to further iterative steps, especially for divergent sequences.

Overall, the success of alignment-phylogeny co-estimation relies not only on the ability to properly explore the joint tree-alignment space but also on the underlying models of character and indel evolution. However, these are similar models that are being used for the assessment of phylogeny, indicating the need for better models, especially in the placement of indels and integration of thermodynamic considerations. Improved models will then not only improve MSA inference, but also contribute to more accurate phylogenetic tree construction.

Manual alignment vs. the need for better models and methods

Here we suggested that manual alignment as commonly practiced suffers from being ad hoc and is often based upon faulty assumptions about the nature of evolutionary processes. This also has been emphasized in an appraisal of manual alignments by Giribet101 who compared manually edited alignments from different manual-alignment experts. Indeed, the complexity of the alignment problem is easily underestimated when very similar sequences are aligned. But for low levels of divergence, the best alignment programs perform very well. For deeper divergences or larger samples, the prospect of ‘successful’ manual curation rapidly decreases, and in the genomics era, curated alignments rapidly lose their appeal. For “difficult” datasets (divergent, with long sequences or many taxa) current automatic approaches are unlikely to be outperformed by manual editing. Greater use of prior knowledge about the data (like 3D structure, knowledge of active sites, a known pattern expected for a protein domain, etc.) facilitates alignments of greater accuracy that cannot be achieved by hand, but has to be incorporated in the optimality criteria.

Clearly, better models and methods are needed (and appear to be on their way). In the meantime, a systematic assessment of assumptions and manual evaluation of the results of different approaches, including progressive and structural alignment with a clear criterion to integrate the two lines of information appears to be the best way forward. For large-scale approaches, this is clearly not possible and inference should be made with an awareness of any methodological weaknesses or faulty assumptions.

Concluding thoughts

Even the best MSA algorithms produce a certain alignment error, which rapidly increases with divergence. Concerns about alignment accuracy may be better treated if scientists understand that any alignment carries an element of uncertainty. For example, Aurahs et al.102 is one of very few studies to consider several candidate MSAs to infer trees. The uncertainty in MSA inference has to be taken into consideration when making strong conclusions based on one alignment, as equally optimal alignments may lead to different inferences. This calls for a rigorous framework for comparison of candidate alignments, as it is often done with phylogenetic trees.103 Simultaneous Bayesian inference of the alignment and tree provides one way of dealing with such uncertainty where a distribution of alignments and trees is the focus rather than single inferences, obtained in a frequentist framework.

The choice of alignment algorithms should be guided by the type of data, including its size, special features and availability of structural and functional information (Table 1). Applying several suitable alignment algorithms and then using a meta-method to evaluate candidate alignments currently is the best option for navigating through the vast space of possible alignments.

Better models that link the molecular and evolutionary mechanisms underlying the substitution and indel processes to MSAs are in their infancy, but the field is developing. Currently, inference can be made using existing computational approaches and thoughtful considerations of underlying assumptions.

It is exciting that alignment and underlying models are re-emerging as hot topics in bioinformatics. The renewed interest in alignment methodology is caused by growing demands for the analyses of large-scale and genomic data. The recent method FSA can align thousands of long sequences, while using a pair HMM to approximate the indel process on a tree and pairwise alignments based on the sequence annealing algorithm (Table 1).27 Advances in sequencing technologies have allowed the rapid sequencing of full genomes, which in turn is driving advances in methodology for aligning and assembling short reads104,105 and for multiple whole genome alignment (eg, see Table 1; MLAGAN,106 Enredo and Pecan,107109 Ortheus110). Despite a number of recent algorithmic advances the genomics alignment field is still in its infancy, presenting succulent challenges, yet to be solved.


MA and GMC are supported by the Swiss Federal Institute of Technology (ETH Zurich); MA also receives funding from the SNSF (award 31003A_127325); DAL receives funding from an NIH-INBRE grant to University of Wyoming and NSF award DBI-0743374.



DA MorrisonWhy would phylogeneticists ignore computerized sequence alignment?Syst Biol200958150158


E ZuckerkandlL PaulingMolecules as documents of evolutionary historyJ Theor Biol1965835766


DA LiberlesK DittmarCharacterizing gene family evolutionBiol Proced Online2008106673


K IllergårdDH ArdellA ElofssonStructure is three to ten times more conserved than sequence - a study of structural response in protein coresProteins200977499508


C ChothiaAM LeskThe evolution of protein structuresCold Spring Harb Symp Quant Biol198752399405


R DurbinS EddyA KroghG MitchisonBiological sequence analysis: probabilistic models of proteins and nucleic acidsCambridge University Press1998


JD ThompsonDG HigginsTJ GibsonImproved sensitivity of profile searches through the use of sequence weights and gap excisionComput Appl Biosci1994101929


C LeeC GrassoMF SharlowMultiple sequence alignment using partial order graphsBioinformatics20021845264


A LöytynojaN GoldmanA model of evolution and structure for multiple sequence alignmentPhilos Trans R Soc Lond B Biol Sci200836339139


A LöytynojaN GoldmanPhylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysisScience200832016325


K KimM KimY WooA DNA sequence alignment algorithm using quality information and a fuzzy inference methodProg Nat Sci200818595602


A SchneiderGM CannarozziGH GonnetEmpirical codon substitution matrixBMC Bioinformatics20056134134


A Doron-FaigenboimT PupkoA combined empirical and mechanistic codon modelMol Biol Evol20072438897


M AnisimovaC KosiolInvestigating protein-coding sequence evolution with probabilistic codon substitution modelsMol Biol Evol20092625571


C KosiolI HolmesN GoldmanAn empirical codon model for protein sequence evolutionMol Biol Evol200724146479


K KatohG AsimenosH TohMultiple alignment of DNA sequences with MAFFTMethods Mol Biol (Clifton, N.J.)20095373964


IL HofackerSHF BernhartPF StadlerAlignment of RNA base pairing probability matricesBioinformatics20042022227


C NotredameEA O'BrienDG HigginsRAGA: RNA sequence alignment by genetic algorithmNucleic Acids Res199725457080


E BlancoR GuigóX MesseguerMultiple non-collinear TF-map alignments of promoter regionsBMC Bioinformatics20078138138


TM PhuongCB DoRC EdgarS BatzoglouMultiple alignment of protein sequences with repeats and rearrangementsNucleic Acids Res200634593242


O PoirotK SuhreC Abergel3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignmentNucleic Acids Res200432W3740


J PeiNV GrishinMUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural informationNucleic Acids Res200634436474


AR SubramanianS HiranR SteinkampDIALIGN-TX and multiple protein alignment using secondary structure information at GOBICSNucleic Acids Res20103814


J KececiogluE KimT WheelerAligning protein sequences with predicted secondary structureJ Comput Biol20101756180


A BremgesS SchirmerR GiegerichFine-tuning structural RNA alignments in the twilight zoneBMC Bioinformatics201011222222


CB DoMSP MahabhashyamM BrudnoS BatzoglouProbCons: Probabilistic consistency-based multiple sequence alignmentGenom Res20051533040


RK BradleyA RobertsM SmootFast statistical alignmentPLoS Comput Biol20095e1000392


V AholaT AittokallioM VihinenE UusipaikkaA statistical score for assessing the quality of multiple sequence alignmentsBMC Bioinformatics20067484484


O PennE PrivmanG LandanAn alignment confidence score capturing robustness to guide-tree uncertaintyMol Biol Evol20102717591767


K KatohK MisawaKi KumaT MiyataMAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transformNucleic Acids Res200230305966


RC EdgarMUSCLE: multiple sequence alignment with high accuracy and high throughputNucleic Acids Res20043217927


SB PanditJ SkolnickFr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-scoreBMC Bioinformatics20089531531


C GilleC FrömmelSTRAP: editor for STRuctural Alignments of ProteinsBioinformatics2001173778


A BahrJD ThompsonJC ThierryO PochBAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutationsNucleic Acids Res2001293236


K MizuguchiCM DeaneTL BlundellJP OveringtonHOMSTRAD: a database of protein structure alignments for homologous familiesProtein science19987246971


GPS RaghavaSMJ SearlePC AudleyOXBench: a benchmark for evaluation of protein multiple sequence alignment accuracyBMC Bioinformatics200344747


A SubramanianJ Weyer-MenkhoffM KaufmannB MorgensternDIALIGN-T: an improved algorithm for segment-based multiple sequence alignmentBMC Bioinformatics200566666


T LassmannEL SonnhammerAutomatic extraction of reliable regions from multiple sequence alignmentsBMC Bioinformatics20078S9S9


C DessimozM GilPhylogenetic assessment of alignments reveals neglected tree signal in gapsGenome Biol201011R37R37


J StoyeMultiple sequence alignment with the divide-and-conquer methodGene1998211GC4556


C LeeC GrassoMF SharlowMultiple sequence alignment using partial order graphsBioinformatics20021845264


C KemenaC NotredameUpcoming challenges for multiple sequence alignment methods in the high-throughput eraBioinformatics200925245565


S HuzurbazarG KolesovSE MasseyLineage-specific differences in the amino acid substitution processJ Mol Biol2010396141021


A RokasSB CarrollFrequent and widespread parallel evolution of protein sequencesMol Biol Evol200825194353


S WhelanP LiòN GoldmanMolecular phylogenetics: state-of-the-art methods for looking into the pastTrends Genet20011726272


A LöytynojaN GoldmanAn algorithm for progressive multiple alignment of sequences with insertionsProc Natl Acad Sci U S A20051021055762


L WangT JiangOn the complexity of multiple sequence alignmentJ Comput Biol1994133748


I WallaceO O'SullivanD HigginsM-Coffee: combining multiple sequence alignment methods with T-CoffeeNucleic Acids Res2006416929


J CastresanaSelection of conserved blocks from multiple alignments for their use in phylogenetic analysisMol Biol Evol20001754052


G TalaveraJ CastresanaImprovement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignmentsSyst Biol200756564564


SE MasseyA ChurbanovS RastogiDA LiberlesCharacterizing positive and negative selection and their phylogenetic effectsGene2008418226


MSS ChangSA BennerEmpirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignmentsJ Mol Biol200434161731


CP PontingG LunterSignatures of adaptive evolution within human non-coding sequenceHuman Molecular Genetics200615R1705


G LunterC PontingJ HeinGenome-wide identification of human functional DNA using a neutral indel modelPLoS Comput Biol20062e5e5


S CreerChoosing and using introns in molecular phylogeneticsEvol Bioinform2007399108


J ThorneH KishinoJ FelsensteinAn evolutionary model for maximum likelihood alignment of DNA sequencesJ Mol Evol19913311424


JL ThorneH KishinoJ FelsensteinInching toward reality: an improved likelihood model of sequence evolutionJ Mol Evol199234316


J ThorneS ChoiJ YuPopulation genetics without intraspecific dataMol Biol Evol200724166777


PG HiggsCompensatory neutral mutations and the evolution of RNAGenetica1998102–10391101


Z YangR NielsenMutation-selection models of codon substitution and their use to estimate selective strengths on codon usageMol Biol Evol20082556879


DM TavernaRA GoldsteinWhy are proteins marginally stable?Proteins2002461059


MA DePristoDM WeinreichDL HartlMissense meanderings in sequence space: a biophysical view of protein evolutionNat Rev Genet2005667887


S RastogiN ReuterDA LiberlesEvaluation of models for the evolution of protein sequences and functions under structural constraintBiophys Chem200612413444


DM RobinsonDT JonesH KishinoProtein evolution with dependence among codons due to tertiary structureMol Biol Evol2003201692704


S HenikoffJG HenikoffPerformance evaluation of amino acid substitution matricesProteins1993174961


MO DayhoffRM SchwartzBC OrcuttA model of evolutionary change in proteinsAtlas of Protein Sequence and Structure19785345352


NJ SavillDC HoylePG HiggsRNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methodsGenetics2001157399411


SQ LeO GascuelAn improved general amino acid replacement matrixMol Biol Evol200825130720


F AbascalD PosadaR ZardoyaMtArt: a new model of amino acid replacement for ArthropodaMol Biol Evol20072415


J AdachiM HasegawaModel of amino acid substitution in proteins encoded by mitochondrial DNAJ Mol Evol19964245968


Z YangR NielsenM HasegawaModels of amino acid substitution and applications to mitochondrial protein evolutionMol Biol Evol199815160011


M DimmicJ RestD MindellrtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogenyJ Mol Evol2002556573


T MullerS RahmannM RehmsmeierNon-symmetric score matrices and the detection of homologous transmembrane proteinsBioinformatics200117S182S189


D RiceD EisenbergA 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence1J Mol Evol1997267102638


S GongTL BlundellDiscarding functional residues from the substitution table improves predictions of active sites within three-dimensional structuresPLoS Comput Biol20084e1000179


NCW GoonesekereB LeeContext-specific amino acid substitution matrices and their use in the detection of protein homologsProteins2008719109


JM KoshiRA GoldsteinContext-dependent optimal substitution matricesProtein Eng199586415


YM HuangC BystroffImproved pairwise alignments of proteins in the Twilight Zone using local structure predictionsBioinformatics20062241322


A BiegertJ SödingSequence context-specific profiles for homology searchingProc Natl Acad Sci U S A200910637705


OS SoyerMW DimmicRR NeubigRA GoldsteinDimerization in aminergic G-protein-coupled receptors: application of a hidden-site class model of evolutionBiochemistry2003421452231


E RivasSR EddyProbabilistic phylogenetic inference with insertions and deletionsPLoS Comput Biol20084e1000172


I MiklósGA LunterI HolmesA “Long Indel” model for evolutionary sequence alignmentMol Biol Evol20042152940


O GotohAn improved algorithm for matching biological sequencesJ Mol Evol19821627058


RC EdgarS BatzoglouMultiple sequence alignmentCurr Opin Struct Biol20061636873


SA BennerMA CohenGH GonnetEmpirical and structural models for insertions and deletions in the divergent evolution of proteinsJ Mol Evol1993229106582


B QianRA GoldsteinDistribution of Indel lengthsProteins2001451024


RK BradleyI HolmesTransducers: an emerging probabilistic framework for modeling indels on treesBioinformatics200723325862


I HolmesUsing guide trees to construct multiple-sequence evolutionary HMMsBioinformatics200319i14757


E RivasEvolutionary models for insertions and deletions in a probabilistic modeling frameworkBMC Bioinformatics200566363


KM WongMA SuchardJP HuelsenbeckAlignment uncertainty and genomic analysisScience20083194736


K LiuS RaghavanS NelesenRapid and accurate large-scale coestimation of sequence alignments and phylogenetic treesScience200932415614


BD RedelingsMA SuchardJoint Bayesian estimation of alignment and phylogenySyst Biol20055440118


F YueJ ShiJ TangSimultaneous phylogeny reconstruction and multiple sequence alignmentBMC Bioinformatics200910S11S11


RC EdgarK SjölanderSATCHMO: sequence alignment and tree construction using hidden Markov modelsBioinformatics200319140411


A VaronLS VinhWC WheelerPOY version 4: phylogenetic analysis using dynamic homologiesCladistics2010267285


MA SuchardBD RedelingsBAli-Phy: simultaneous Bayesian inference of alignment and phylogenyBioinformatics20062220478


A NovákI MiklósR LyngsøJ HeinStatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary treesBioinformatics20082424034


R SatijaA NovákI MiklósBigFoot: Bayesian alignment and phylogenetic footprinting with MCMCBMC Evol Biol20099217217


R FleissnerD MetzlerA von HaeselerSimultaneous statistical multiple alignment and phylogeny reconstructionSystematic Biology20055454861


A StamatakisRAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed modelsBioinformatics200622268890


G GiribetWC WheelerJ MuonaDNA multiple sequence alignmentsEXS200210714


R AurahsM GökerGW GrimmUsing the Multiple Analysis Approach to Reconstruct Phylogenetic Relationships among Planktonic Foraminifera from Highly Divergent and Length-polymorphic SSU rDNA SequencesBioinformatics and Biology Insights2009315577


S Aris-BrosouHow Bayes tests of molecular phylogenies compare with frequentist approachesBioinformatics20031961824


C TrapnellSL SalzbergHow to map billions of short reads onto genomesNature Biotechnology2009274557


MJ ChaissonD BrinzaP PevznerDe novo fragment assembly with short mate-paired reads: Does the read length matter?Genome Res20091933646


M BrudnoS MaldeA PoliakovGlocal alignment: finding rearrangements during alignmentBioinformatics200319i5462


B PatenJ HerreroK BealEnredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogsGenome research200818181428


B PatenJ HerreroK BealE BirneySequence progressive alignment, a framework for practical large-scale probabilistic consistency alignmentBioinformatics200925295301


M BrudnoCB DoGM CooperLAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNAGenome Res20031372131


B PatenJ HerreroS FitzgeraldGenome-wide nucleotide-level mammalian ancestor reconstructionGenome Res200818182943


M LarkinG BlackshieldsN BrownP ChennaClustal W and Clustal X version 2.0Bioinformatics20072329478


C NotredameL HolmD HigginsCOFFEE: an objective function for multiple sequence alignmentsBioinformatics19981440722


C NotredameD HigginsJ HeringaT-coffee: a novel method for fast and accurate multiple sequence alignment1J Mol Biol2000820517


C GrassoC LeeCombining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problemsBioinformatics200420154656


PAS NuinZ WangERM TillierThe accuracy of several multiple sequence alignment programs for proteinsBMC Bioinformatics20067471471


U RoshanDR LivesayProbalign: multiple sequence alignment using partition function posterior probabilitiesBioinformatics200622271521


J PeiNV GrishinPROMALS: towards accurate multiple sequence alignments of distantly related proteinsBioinformatics2007238028


A WilmDG HigginsC NotredameR-Coffee: a method for multiple alignment of non-coding RNA. NuclAcids Res200836e52e52


O O'SullivanK SuhreC Abergel3DCoffee: combining protein sequences and structures within multiple sequence alignmentsJ Mol Biol200434038595


F ArmougomS MorettiO PoirotExpresso: automatic incorporation of structural information in multiple sequence alignments using 3D-CoffeeNucl Acids Res200634W6048


J HeringaLocal weighting schemes for protein multiple sequence alignmentComput Chem200226459477


J HeringaTwo strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignmentComput Chem19992334164


W PirovanoKA FeenstraJ HeringaPRALINETM: a strategy for improved multiple alignment of transmembrane proteinsBioinformatics2008244927


SME SahraeianBJ YoonPicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequencesNucl Acids Res201038491728


EH MarguliesGM CooperG AsimenosAnalyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genomeGenome Res20071776074

Abstract views:


Article Metrics

Metrics Loading ...

Metrics powered by PLOS ALM

Copyright (c) 2010 Maria Anisimova, Gina Cannarozzi, David A. Liberles

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© PAGEPress 2008-2016     -     PAGEPress is a registered trademark property of PAGEPress srl, Italy.     -     VAT: IT02125780185