Gene 508, Evan
Eichler, Ph.D. 03/03
OBJECTIVE:
The purpose of this lecture is to introduce you to programs that can be used for
1) the identification of the most likely physical structure for a gene (cDNA) based on genomic DNA comparison (sim4)
2) ab initio prediction of exons and genes from genomic DNA (genscan).
--similarity based tool for aligning expressed sequence to genomic DNA
--BLAST defines “exon cores” –maximal gap free exact matches
--exend into “unmatched regions”
--heurisitcs are used to favor configurations that conform to splice site recognition
--process is repeated under less stringent parameters
IMPLEMENTATION
sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}
A Specifies the format of the output: exon endpoints only (A=0),
exon endpoints and boundaries of the coding region (CDS) in the
genomic sequence, when specified for the input mRNA (A=5),
alignment text (A=1), alignment in lav-block format (A=2), or
both exon endpoints and alignment text (A=3 or A=4
A recent powerful alternate to SIM4 is Spidey:
Wheelan et al., 2001, Genome Res. 11:1952-1957.
--Burge and Karlin, 1997 JMB 268:78-94.
--computational prediction of exons and gene-strutures
--initially programs developed to predict genes based on
1) promoters
2) splice sites
3) coding regions (ORFS).
--later programs (FGENEH, GENMARK, ID, GRAIL) predict genes and gene
structures integrate mutliple sources of information
1) ORF, splice sites, promoters
2) compositional properties of non coding and coding DNA
3) homology searches to predict sets of spliceable exons=gene structures.
--2 problems
1) assume input sequences contain only one gene model.
2) accuracy of exons is less than perfect—predicted protein nonsense
The power of Genscan
1) emphasizes known translational and transcriptional signals
2) uses a fifth-order Markov model of known coding regions (as opposed to particular homology searches)—not dependent on a specific gene in database
3) takes into account GC% composition influence on intron length
4) considers both strands
5) allows for more than one gene structure and partial gene structure.
Sn= sensitivity (True positives/actual positives)
Sp= specificity (True positives/predicted positives)
AC=approximate coorelation
CC=Correlation coefficient
WE= wrong exons
ME=missed exons
Modifications/improvements in genscan:
GenomeScan—assigns a higher score to putative exons with high BLASTX score
TwinScan—increases cross-species conservation as part of the gene-prediction model
The most important
restrictions are that only protein coding genes are considered (and not tRNA or
rRNA genes, for example), and that transcription units are assumed to be
non-overlapping.
INPUT: Fasta file name and model
OUTPUT: exon, gene model and postscript file
usage: genscan parfname seqfname [-v] [-cds] [-subopt cutoff] [-ps psfname scale]
1) parfname : full pathname of parameter file
(for appropriate organism)
HumanIso.smat for human/vertebrate sequences (also Drosophila)
Arabidopsis.smat for Arabidopsis thaliana sequences
Maize.smat for Zea mays sequences
--path in our case is /mnt/raid/genetics/genscan/*
seqfname : full pathname of sequence file
(FastA or minimal GenBank format)
-v : verbose output (extra explanatory info)
-cds : print predicted coding sequences (nucleic acid)
-subopt : display suboptimal exons with P > cutoff (optional)
cutoff : suboptimal exon probability cutoff (minimum: 0.01 max 0.99)
-ps : create Postscript output (optional)
psfname : filename for PostScript output
scale : scale for PostScript output (bp per line: can not be greater than 1/4th sequence