Gene 508, Evan
Eichler, Ph.D. 01/2003
Using Raw Trace Data.
-Most sequence is generated by fluorescent sequencing—in which both bases and base quality may be assigned.
--typically most data (small-scale laboratory, large-scale genome projects, private industry) is generated by an unpublished algorithm from ABI (Applied Biosystems) which takes into account peak height, peak spacing and assigns N in the absence of a good choice.
--CHROMATOGRAM TRACE PROPERTIES
1) 1st 50 bases are noisy due to unreacted dye-primer or dye-terminator molecules or primer dimmer interference
2) End of trace, less evenly spaced due to diffusion effects and decrease in the relative mass difference between the sequence products—lumpy peaks.
3) Compressions of sequence particularly in GC rich sequence—effect more pronounced for dye-terminator vs. dye-primer.
--PHRED/PHRAP/CONSED are a collection of programs developed by Phil Green that specifically define an algorithm that assigns an error probability to each base related to its trace quality, an error probability to the consensus sequence and allows viewing and editing of aligned consensus.
--IMPORTANCE.
1) Access to primary data without human interpretation.
2) Allows to generate meaningful consensus (cDNA assembly or genome assembly)
3) Cost efficient--Allows all primary data to be accessed to make meaningful interpretation of results.
4) Provides a systematic method for analysis and storage of data.
I. PHRED
PHRED—Phil Green’s base
calling software.
--INPUT: ABI trace files or SCF (Standard
Chromatogram Format).
--OUTPUT: Phd files (phred scored base calls)
--Four steps:
1)
Predicts idealized (expected) peaks (amplitudes) based effectively on the best
region of the trace.
2)
Identifies observed peaks
3)
Compares observed and expected peaks (divides the peaks into matched and
unmatched).
4)
Unmatched peaks are analyzed for any peak that could be called, but was not
called in step 3.
--Takes into account peak
height, peak spacing and potential compressions.
PHRED QUALITY VALUES
--Log-transformed error
probabilities using an algorithm that
1) Uses parameters of a trace over a window rather than
single peaks
2)
discriminates within the high quality part of the trace instead of over the
entire window.
3)
that has been validated from a training set (real data).
PHRED SCORE
q= -10 x log10 (p)
where q is the quality score (PHRED value)
p is the estimated error probability of that call.
eg. PHRED score of 20 = 1/100 chance of being
incorrect
30 = 1/1000 change of being incorrect
II. PHRAP
PHRAP—Phil Green’s
assembly software.
INPUT: Phd files.
OUTPUT: ace (assembly) files
--assembly of sequencing
reads uses
1)
cross match algorithm—modified Smith-Waterman-Gotoh alogrithm for assembly
2)
consensus error probabilities based on the sum of individual base
probabilities.
PHRAP SCORE
--expected number of all
errors in a consensus sequence is the sum of individual base probability
errors.
--If two aligned bases
q=15 and q=25 and are in agreement then sum = PHRAP score.
--Average phrap score for
an assembly i.e. 40 (predicted error in
consensus of 1in 10,000 bases).
Consed—a graphical tool for viewing and editing sequence assembly data
INPUT: Requires three types of files
1) Chromatogram files (chromat_dir) original trace data
2) Phd files (phd_dir) (Phred quality data)
3) Ace files (contig_dir) generated by Phrap
OUTPUT: GUI that allows analysis and manipulation of output.
Quality tags —white, upper case = high quality
Mismatch tags --indicated in red
Navigation to high quality and low quality discrepancy
Compare and merge contigs of sequence data
Standard fasta output as text.
This tutorial was kindly provided by Dr. Gabor Marth, a more detailed description may be found at.
http://stein.cshl.org/genome_informatics/BasicSequenceManipulation/pages/sequencingInformatics.html
1) Make four directories labelled (within another directory called CLONE):
chromat_dir edit_dir phd_dir
2) Place chromatogram files into chromat_dir
3) Place vector.fasta for vector clipping into edit_dir
4) To construct base quality values—within your chromat_dir directory run the phred program using the specified parameters. Phd files will be created and placed into the phd_dir.
phred
-id . –p –pd ../phd_dir
5)Within edit dir, now run the program phd2fasta (using settings below). This will create two files. A fasta file of all reads called Clone.fasta and a second fasta file that contains the base quality values associated with each nucleotide.
phd2fasta -id ../phd_dir -os CLONE.fasta
-oq CLONE.fasta.qual
4) Within edit dir, now run the program cross_match (using the specified settings below). This will effectively screen the CLONE.fasta file for vector sequence (by crossmatch against vector.fasta). This is the vector clipping stage. The output will be CLONE.fasta.screen
cross_match -minmatch
12 -penalty -2 -minscore 20 -screen CLONE.fasta vector.fasta
5) Sequence Assembly: Within edit_dir, now launch the phrap sequence assembler. First, generate a copy of the CLONE.fasta.qual file and call it CLONE.fasta.screen.qual
cp CLONE.fasta.qual CLONE.fasta.screen.qual
In order to run PHRAP it requires a set of input fasta files (in this case CLONE.fasta.screen) and a set of quality data (just created called CLONE.fasta.screen.qual).
Run the assembly
phrap -new_ace CLONE.fasta.screen
A number of files are generated including fasta files CLONE.fasta.screen.contigs (fasta format of the contigs generated by cross-match) and an ace file that allows for viewing of the assembly.
6) Run Consed and select the ace file to view and edit your assembly.
1) consedrun (Carl Kashuk)
--C script that combines phred/phrap/consedrun into a single command line option
--All chromat files must be in a directory
--vector clipping is done with a different library supervector.fasta
Options
usage: consedrun dirname
-d clean out phd_dir, poly_dir, .ace files first
-pd don't run phred
-ph don't run phrap (useless!)
-q ignore phred quality scores
-pp don't run polyphred
-c don't run consed
-r consed only (overrides other params)
-a use old ace file format
-forcelevel (0-10) phrap param to attempt to force contigs together
-rank (1-6) polyphred minimum rank to report (default 3)
-ratio (0-1) polyphred ratio (default 0.65)
-background (0-1) polyphred back (default 0.25)
-v mask sequences with supervector
(also looks for vectors.fasta in edit_dir)
-tf name.fasta[,left,right] use name as target file
-tg name.genbank[,left,right] use name as target file
-tu uid[,left,right] use uid as target file
(also looks for dirname.target.fasta to use as target)
-findPolys target.fasta list.txt
-findPolys target.fasta numeric
-findPolys target.fasta alpha
-findPolysG accession or uid list.txt
-findPolysG accession or uid numeric
-findPolysG accession or uid alpha
-fpQual 20 quality threshold
-fpCons consensus mismatches only
findPolys creates a file dirname.nav which lists polymorphic sites vs
target.fasta. polyPhred results are incorporated when available.
reads can be sorted in the .nav file with the numeric, alpha, or list.txt
options. editing polymorphism tags through consed will be saved
in this .nav file, and will eventually be used to record
polymorphisms into databases
2) TREV
Opening a trace file
Trace files can be opened either on the command line or from within Trev.
Opening a trace file from the command line
usage: trev [-{ABI,ALF,EXP,SCF,PLN,Any}] [-edits value] [-restrict] [tracefilename]
-ABI, -ALF, -EXP, -SCF, -PLN, -Any
Optional. Defaults to Any. These define the possible input trace formats available. Currently these are 'ABI', 'ALF', experiment (see section Experiment File), 'SCF' (see section scf),
plain ASCII text or 'any' whereby the program attempts to establish the file format from information contained within the trace file.
-edits value
Optional. Defaults to 1. If value is 1, the trace sequence can be edited. If value is 0, no edit line is displayed in Trev and the sequence may not be edited.
-restrict
Optional. Restricts the use of the trace editor to a single file by disabling the ability to open another file from within Trev. The main use of this option is for calling Trev from within
scripts.
-xmag Optional. Defaults to 150. Specifies the magnification along the X axis of the trace. Larger values represent higher magnifications.
-ymag Optional. Defaults to 10. Specifies the magnification along the Y axis of the trace. The value should be between 10 and 100 with 10 showing all the trace and 100 being the largest magnification.