Lecture 6: Sequence Assembly: PHRED/PHRAP/CONSED

Gene 508, Evan Eichler, Ph.D. 01/2003

Using Raw Trace Data.

-Most sequence is generated by fluorescent sequencing—in which both bases and base quality may be assigned.

--typically most data (small-scale laboratory, large-scale genome projects, private industry) is generated by an unpublished algorithm from ABI (Applied Biosystems) which takes into account peak height, peak spacing and assigns N in the absence of a good choice.

--CHROMATOGRAM TRACE PROPERTIES

1) 1^st 50 bases are noisy due to unreacted dye-primer or dye-terminator molecules or primer dimmer interference

2) End of trace, less evenly spaced due to diffusion effects and decrease in the relative mass difference between the sequence products—lumpy peaks.

3) Compressions of sequence particularly in GC rich sequence—effect more pronounced for dye-terminator vs. dye-primer.

--PHRED/PHRAP/CONSED are a collection of programs developed by Phil Green that specifically define an algorithm that assigns an error probability to each base related to its trace quality, an error probability to the consensus sequence and allows viewing and editing of aligned consensus.

--IMPORTANCE.

1) Access to primary data without human interpretation.

2) Allows to generate meaningful consensus (cDNA assembly or genome assembly)

3) Cost efficient--Allows all primary data to be accessed to make meaningful interpretation of results.

4) Provides a systematic method for analysis and storage of data.

I. PHRED

PHRED—Phil Green’s base calling software.

--INPUT: ABI trace files or SCF (Standard Chromatogram Format).

--OUTPUT: Phd files (phred scored base calls)

--Four steps:

1) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the trace.

2) Identifies observed peaks

3) Compares observed and expected peaks (divides the peaks into matched and unmatched).

4) Unmatched peaks are analyzed for any peak that could be called, but was not called in step 3.

--Takes into account peak height, peak spacing and potential compressions.

PHRED QUALITY VALUES

--Log-transformed error probabilities using an algorithm that

1) Uses parameters of a trace over a window rather than single peaks

2) discriminates within the high quality part of the trace instead of over the entire window.

3) that has been validated from a training set (real data).

PHRED SCORE

q= -10 x log₁₀ (p)

where q is the quality score (PHRED value)

p is the estimated error probability of that call.

eg. PHRED score of 20 = 1/100 chance of being incorrect

30 = 1/1000 change of being incorrect

II. PHRAP

PHRAP—Phil Green’s assembly software.

INPUT: Phd files.

OUTPUT: ace (assembly) files

--assembly of sequencing reads uses

1) cross match algorithm—modified Smith-Waterman-Gotoh alogrithm for assembly

2) consensus error probabilities based on the sum of individual base probabilities.

PHRAP SCORE

--expected number of all errors in a consensus sequence is the sum of individual base probability errors.

--If two aligned bases q=15 and q=25 and are in agreement then sum = PHRAP score.

--Average phrap score for an assembly i.e. 40 (predicted error in consensus of 1in 10,000 bases).

III. CONSED

Consed—a graphical tool for viewing and editing sequence assembly data

INPUT: Requires three types of files

1) Chromatogram files (chromat_dir) original trace data

2) Phd files (phd_dir) (Phred quality data)

3) Ace files (contig_dir) generated by Phrap

OUTPUT: GUI that allows analysis and manipulation of output.

Quality tags —white, upper case = high quality

Mismatch tags --indicated in red

Navigation to high quality and low quality discrepancy

Compare and merge contigs of sequence data

Standard fasta output as text.

IMPLEMENTATION

This tutorial was kindly provided by Dr. Gabor Marth, a more detailed description may be found at.

http://stein.cshl.org/genome_informatics/BasicSequenceManipulation/pages/sequencingInformatics.html

1) Make four directories labelled (within another directory called CLONE):

chromat_dir edit_dir phd_dir

2) Place chromatogram files into chromat_dir

3) Place vector.fasta for vector clipping into edit_dir

4) To construct base quality values—within your chromat_dir directory run the phred program using the specified parameters. Phd files will be created and placed into the phd_dir.

phred -id . –p –pd ../phd_dir

5)Within edit dir, now run the program phd2fasta (using settings below). This will create two files. A fasta file of all reads called Clone.fasta and a second fasta file that contains the base quality values associated with each nucleotide.

phd2fasta -id ../phd_dir -os CLONE.fasta -oq CLONE.fasta.qual

4) Within edit dir, now run the program cross_match (using the specified settings below). This will effectively screen the CLONE.fasta file for vector sequence (by crossmatch against vector.fasta). This is the vector clipping stage. The output will be CLONE.fasta.screen

cross_match -minmatch 12 -penalty -2 -minscore 20 -screen CLONE.fasta vector.fasta

5) Sequence Assembly: Within edit_dir, now launch the phrap sequence assembler. First, generate a copy of the CLONE.fasta.qual file and call it CLONE.fasta.screen.qual

cp CLONE.fasta.qual CLONE.fasta.screen.qual

In order to run PHRAP it requires a set of input fasta files (in this case CLONE.fasta.screen) and a set of quality data (just created called CLONE.fasta.screen.qual).

Run the assembly

phrap -new_ace CLONE.fasta.screen

A number of files are generated including fasta files CLONE.fasta.screen.contigs (fasta format of the contigs generated by cross-match) and an ace file that allows for viewing of the assembly.

6) Run Consed and select the ace file to view and edit your assembly.

COMPANION SCRIPTS

1) consedrun (Carl Kashuk)

--C script that combines phred/phrap/consedrun into a single command line option

--All chromat files must be in a directory

--vector clipping is done with a different library supervector.fasta

Options

usage: consedrun dirname

-d clean out phd_dir, poly_dir, .ace files first

-pd don't run phred

-ph don't run phrap (useless!)

-q ignore phred quality scores

-pp don't run polyphred

-c don't run consed

-r consed only (overrides other params)

-a use old ace file format

-forcelevel (0-10) phrap param to attempt to force contigs together

-rank (1-6) polyphred minimum rank to report (default 3)

-ratio (0-1) polyphred ratio (default 0.65)

-background (0-1) polyphred back (default 0.25)

-v mask sequences with supervector

(also looks for vectors.fasta in edit_dir)

-tf name.fasta[,left,right] use name as target file

-tg name.genbank[,left,right] use name as target file

-tu uid[,left,right] use uid as target file

(also looks for dirname.target.fasta to use as target)

-findPolys target.fasta list.txt

-findPolys target.fasta numeric

-findPolys target.fasta alpha

-findPolysG accession or uid list.txt

-findPolysG accession or uid numeric

-findPolysG accession or uid alpha

-fpQual 20 quality threshold

-fpCons consensus mismatches only

findPolys creates a file dirname.nav which lists polymorphic sites vs

target.fasta. polyPhred results are incorporated when available.

reads can be sorted in the .nav file with the numeric, alpha, or list.txt

options. editing polymorphism tags through consed will be saved

in this .nav file, and will eventually be used to record

polymorphisms into databases

2) TREV

Opening a trace file

Trace files can be opened either on the command line or from within Trev.

Opening a trace file from the command line

usage: trev [-{ABI,ALF,EXP,SCF,PLN,Any}] [-edits value] [-restrict] [tracefilename]

-ABI, -ALF, -EXP, -SCF, -PLN, -Any

Optional. Defaults to Any. These define the possible input trace formats available. Currently these are 'ABI', 'ALF', experiment (see section Experiment File), 'SCF' (see section scf),

plain ASCII text or 'any' whereby the program attempts to establish the file format from information contained within the trace file.

-edits value

Optional. Defaults to 1. If value is 1, the trace sequence can be edited. If value is 0, no edit line is displayed in Trev and the sequence may not be edited.

-restrict

Optional. Restricts the use of the trace editor to a single file by disabling the ability to open another file from within Trev. The main use of this option is for calling Trev from within

scripts.

-xmag Optional. Defaults to 150. Specifies the magnification along the X axis of the trace. Larger values represent higher magnifications.

-ymag Optional. Defaults to 10. Specifies the magnification along the Y axis of the trace. The value should be between 10 and 100 with 10 showing all the trace and 100 being the largest magnification.