Lecture 12:  Gene Identification and Characterization. 

Gene 508, Evan Eichler, Ph.D.                                                                                  03/03

 

OBJECTIVE:

The purpose of this lecture is to introduce you to programs that can be used for

1) the identification of the most likely physical structure for a gene (cDNA) based on genomic DNA comparison (sim4)

2) ab initio prediction of exons and genes from genomic DNA (genscan).

 

I.  SIM4

            --similarity based tool for aligning expressed sequence to genomic DNA

            --BLAST defines “exon cores” –maximal gap free exact matches

            --exend into “unmatched regions”

            --heurisitcs are used to favor configurations that conform to splice site recognition

            --process is repeated under less stringent parameters

 

IMPLEMENTATION

sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}

 

A   Specifies the format of the output: exon endpoints only (A=0),

          exon endpoints and boundaries of the coding region (CDS) in the

          genomic sequence, when specified for the input mRNA (A=5),

          alignment text (A=1), alignment in lav-block format (A=2), or

          both exon endpoints and alignment text (A=3 or A=4

 

A recent powerful alternate to SIM4 is Spidey:

Wheelan  et al., 2001, Genome Res. 11:1952-1957.

 

 

 

II.  GENSCAN

            --Burge and Karlin, 1997 JMB 268:78-94.

            --computational prediction of  exons and gene-strutures

            --initially programs developed to predict genes based on

            1) promoters

            2) splice sites

            3) coding regions (ORFS).

            --later programs (FGENEH, GENMARK, ID, GRAIL) predict genes and gene

            structures integrate mutliple sources of information

            1) ORF, splice sites, promoters

            2) compositional properties of non coding and coding DNA

            3) homology searches to predict sets of spliceable exons=gene structures.

            --2 problems

            1) assume input sequences contain only one gene model.

            2) accuracy of exons is less than perfect—predicted protein nonsense

 

The power of Genscan

1) emphasizes known translational and transcriptional signals

2) uses a fifth-order Markov model of known coding regions (as opposed to particular homology searches)—not dependent on a specific gene in database

3) takes into account GC% composition influence on intron length

4) considers both strands

5) allows for more than one gene structure and partial gene structure.

 

 

           

 

Sn= sensitivity (True positives/actual positives)

Sp= specificity (True positives/predicted positives)

AC=approximate coorelation

CC=Correlation coefficient

WE= wrong exons

ME=missed exons

 

 

Modifications/improvements in genscan:

GenomeScan—assigns a higher score to putative exons with high BLASTX score

TwinScan—increases cross-species conservation as part of the gene-prediction model 

 

 

 

 The most important restrictions are that only protein coding genes are considered (and not tRNA or rRNA genes, for example), and that transcription units are assumed to be non-overlapping.

 

 

IMPLEMENTATION

INPUT:  Fasta file name and model

OUTPUT:  exon, gene model and postscript file

 

usage: genscan parfname seqfname [-v] [-cds] [-subopt cutoff] [-ps psfname scale]

 

       1) parfname : full pathname of parameter file

                  (for appropriate organism)

HumanIso.smat             for human/vertebrate sequences (also Drosophila)

Arabidopsis.smat                      for Arabidopsis thaliana sequences

Maize.smat                   for Zea mays sequences

--path in our case is  /mnt/raid/genetics/genscan/*

 

       seqfname : full pathname of sequence file

                  (FastA or minimal GenBank format)

 

OPTIONS

 

       -v       : verbose output (extra explanatory info)

 

       -cds     : print predicted coding sequences (nucleic acid)

 

       -subopt  : display suboptimal exons with P > cutoff (optional)

       cutoff   : suboptimal exon probability cutoff (minimum: 0.01 max 0.99)

 

       -ps      : create Postscript output (optional)

       psfname  : filename for PostScript output

       scale    : scale for PostScript output (bp per line: can not be greater than 1/4th sequence