Lecture 12: Gene Identification and Characterization.

Gene 508, Evan Eichler, Ph.D. 03/03

OBJECTIVE:

The purpose of this lecture is to introduce you to programs that can be used for

1) the identification of the most likely physical structure for a gene (cDNA) based on genomic DNA comparison (sim4)

2) ab initio prediction of exons and genes from genomic DNA (genscan).

I. SIM4

--similarity based tool for aligning expressed sequence to genomic DNA

--BLAST defines “exon cores” –maximal gap free exact matches

--exend into “unmatched regions”

--heurisitcs are used to favor configurations that conform to splice site recognition

--process is repeated under less stringent parameters

IMPLEMENTATION

sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}

A Specifies the format of the output: exon endpoints only (A=0),

exon endpoints and boundaries of the coding region (CDS) in the

genomic sequence, when specified for the input mRNA (A=5),

alignment text (A=1), alignment in lav-block format (A=2), or

both exon endpoints and alignment text (A=3 or A=4

A recent powerful alternate to SIM4 is Spidey:

Wheelan et al., 2001, Genome Res. 11:1952-1957.

II. GENSCAN

--Burge and Karlin, 1997 JMB 268:78-94.

--computational prediction of exons and gene-strutures

--initially programs developed to predict genes based on

1) promoters

2) splice sites

3) coding regions (ORFS).

--later programs (FGENEH, GENMARK, ID, GRAIL) predict genes and gene

structures integrate mutliple sources of information

1) ORF, splice sites, promoters

2) compositional properties of non coding and coding DNA

3) homology searches to predict sets of spliceable exons=gene structures.

--2 problems

1) assume input sequences contain only one gene model.

2) accuracy of exons is less than perfect—predicted protein nonsense

The power of Genscan

1) emphasizes known translational and transcriptional signals

2) uses a fifth-order Markov model of known coding regions (as opposed to particular homology searches)—not dependent on a specific gene in database

3) takes into account GC% composition influence on intron length

4) considers both strands

5) allows for more than one gene structure and partial gene structure.

Sn= sensitivity (True positives/actual positives)

Sp= specificity (True positives/predicted positives)

AC=approximate coorelation

CC=Correlation coefficient

WE= wrong exons

ME=missed exons

Modifications/improvements in genscan:

GenomeScan—assigns a higher score to putative exons with high BLASTX score

TwinScan—increases cross-species conservation as part of the gene-prediction model

The most important restrictions are that only protein coding genes are considered (and not tRNA or rRNA genes, for example), and that transcription units are assumed to be non-overlapping.

IMPLEMENTATION

INPUT: Fasta file name and model

OUTPUT: exon, gene model and postscript file

usage: genscan parfname seqfname [-v] [-cds] [-subopt cutoff] [-ps psfname scale]

1) parfname : full pathname of parameter file

(for appropriate organism)

HumanIso.smat for human/vertebrate sequences (also Drosophila)

Arabidopsis.smat for Arabidopsis thaliana sequences

Maize.smat for Zea mays sequences

--path in our case is /mnt/raid/genetics/genscan/*

seqfname : full pathname of sequence file

(FastA or minimal GenBank format)

OPTIONS

-v : verbose output (extra explanatory info)

-cds : print predicted coding sequences (nucleic acid)

-subopt : display suboptimal exons with P > cutoff (optional)

cutoff : suboptimal exon probability cutoff (minimum: 0.01 max 0.99)

-ps : create Postscript output (optional)

psfname : filename for PostScript output

scale : scale for PostScript output (bp per line: can not be greater than 1/4^th sequence