Lecture 5:  Multiple Sequence Alignment:ClustalW

Gene 508, Evan Eichler, Ph.D.

 

I.                   The Purpose

1)      Determination of consensus sequence from a family of related sequences

2)      Identify regions of conservation and regions of rapid divergence

3)      The most basic first step in phylogenetic analysis-almost all tree-building methods require a multiple alignment as input.

 

II.                The Problem

Multiple global alignments of sequences in a rigorous fashion is possible BUT the computational time required to perform such searches is proportional to the product of the sequence lengths. 

--BiG 0  notation becomes exponential--On

--if a pairwise alignment requires n x m time, then a multiple alignment of three sequences using the Needleman Wunsch algorithm would take n x m x o time, etc

 

 

III.             The Solution

Instead of a true multiple alignment…perform a series of progressive pairwise alignments in a step-by-step fashion.

 

The Steps:

 

1) Perfrom all possible pairwise alignments among the sequences.  Becomes impractical after some point of n sequences n(n-1)/2 pairs for n given

                   sequences.  20 min if 1 pairwise takes 1 second and n=50.

           

2) Establish a hierarchy of relationships (based on UPGMA phylogeny or simple  ordering)—usually distance methods are chosen because they are the most rapid

 

3)      Identify the most similar pair…generate a consensus and use this consensus (averaged scores to compare against the next closest members). Repeat.

 

 

Some Problems with the Solution:

1)      Still requires a trained eye to resolve some obvious discrepancy.

2)      An inherent guide tree is created which obviously will dictate phylogenetic analysis to some degree

3)      Order of input sequences sometimes matters.

 

 

 

 

      

IV.              A Solution:CLUSTALW

 

·        Thompson, Higgins and Gibson, 1994: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994 Nov 11;22(22):4673-80.

·        “unabashedly ad hoc”:  A Heuristic that works sometimes.

·        Why does it work:  1) Weights sequences to compensate for biased representation

  2) Proteins:  close sequences (Blosum 80 hard matrix)

                       distant sequence (Blosum 50 soft matrix

3) Minimizes a gap blitzkrieg (by increasing affine parameters based on proximity)

4)      Treatment of low-scoring alignment.

 

IMPLEMENTATION

 

clustalw

 

·        Input file: a single file with multiple files in one of six formats (suggested fasta)

·        Menu driven options

 

Guy Bottu, http://ben.vub.ac.be/embnet.news/vol2_1/align.html)

**************************************************************

 ******** CLUSTAL W(1.4) Multiple Sequence Alignments  ********

 **************************************************************

 

     1. Sequence Input From Disc

     2. Multiple Alignments

     3. Profile Alignments

     4. Phylogenetic trees

 

     S. Execute a system command

     H. HELP

     X. EXIT (leave program)

 

Your choice: 2

 

 To do a multiple alignment, enter "2" and the following menu is displyed:

 

****** MULTIPLE ALIGNMENT MENU ******

 

    1.  Do complete multiple alignment now (Slow/Accurate)

    2.  Produce guide tree file only

    3.  Do alignment using old guide tree file

 

    4.  Toggle Slow/Fast pairwise alignments = SLOW

 

    5.  Pairwise alignment parameters

    6.  Multiple alignment parameters

 

    7.  Reset gaps between alignments? = ON

    8.  Toggle screen display          = ON

    9.  Output format options

 

    S.  Execute a system command

    H.  HELP

    or press [RETURN] to go back to main menu

 

Your choice: 5

 

 This menu contains some new features with regard to the clustlv program

(4, 7, 8).

 Option 5 shows the default parameters.

 

 ********* PAIRWISE ALIGNMENT PARAMETERS *********

 

     Slow/Accurate alignments:

 

     1. Gap Open Penalty       :10.00

     2. Gap Extension Penalty  :0.10

     3. Protein weight matrix  :BLOSUM30

 

     Fast/Approximate alignments:

 

     4. Gap penalty            :3

     5. K-tuple (word) size    :1

     6. No. of top diagonals   :5

     7. Window size            :5

 

     8. Toggle Slow/Fast pairwise alignments = SLOW

 

     H. HELP

 

Enter number (or [RETURN] to exit):

 

****** MULTIPLE ALIGNMENT MENU ******

 

    1.  Do complete multiple alignment now (Slow/Accurate)

    2.  Produce guide tree file only

    3.  Do alignment using old guide tree file

 

    4.  Toggle Slow/Fast pairwise alignments = SLOW

 

    5.  Pairwise alignment parameters

    6.  Multiple alignment parameters

 

    7.  Reset gaps between alignments? = ON

    8.  Toggle screen display          = ON

    9.  Output format options

 

    S.  Execute a system command

    H.  HELP

    or press [RETURN] to go back to main menu

 

Your choice: 6

 

Hit return to see the default multiple alignement parameters.

 

****** MULTIPLE ALIGNMENT PARAMETERS ******

 

     1. Gap Opening Penalty              :10.00

     2. Gap Extension Penalty            :0.05

     3. Delay divergent sequences        :40 %

 

     4. Toggle Transitions (DNA)         :Weighted

 

     5. Protein weight matrix            :BLOSUM series

     6. Use negative matrix              :OFF

     7. Protein Gap Parameters

 

     H. HELP

 

Enter number (or [RETURN] to exit):

 

****** MULTIPLE ALIGNMENT MENU ******

 

    1.  Do complete multiple alignment now (Slow/Accurate)

    2.  Produce guide tree file only

    3.  Do alignment using old guide tree file

 

    4.  Toggle Slow/Fast pairwise alignments = SLOW

 

    5.  Pairwise alignment parameters

    6.  Multiple alignment parameters

 

    7.  Reset gaps between alignments? = ON

    8.  Toggle screen display          = ON

    9.  Output format options

 

    S.  Execute a system command

    H.  HELP

    or press [RETURN] to go back to main menu