Lecture 13: Integrated Genomic Analysis: UCSC Browser

Gene 508, Evan Eichler, Ph.D. and Royden Clark 03/03

OBJECTIVE:

The purpose of this lecture is

1) To provide an overview of the UCSC genome browser and its value as coordinated system for cross-referencing and integrating disparate sources of genomic data and analyses

2) To allow you to incorporate results from your analysis as a custom track.

Reading:

Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM,Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ. The UCSC Genome Browser Database. Nucleic Acids Res. 2003 Jan 1;31(1):51-4.

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC.

Genome Res. 2002 Jun;12(6):996-1006.

I. DATABASES

How will the information from various genome projects be archived, curated, integrated and analyzed?  How can the average researcher extract information from the available databases to design better experiments?

Primary databases:  Repository of raw data with little or no annotation error probabilities well-defined.  eg . trace archives from NCBI..

Secondary databases:  Repositories which contain analyzed data and annotation process.  Errors are highly dependent on the methodology.   Eg. Unigene clusters, expression array data, RefSeq

Integrated databases: Repositories which cross-reference and integrate multiple secondary databases.  Eg. UCSC, Ensembl and NCBI browsers

*Genomics is quickly becoming a labyrinth of different databases.

*While the number of  Secondary databases that can be constructed is infinite, the genome upon which they are constructed is finite.

*skillfully navigate between these different databases to address biological questions experimentally and computationally.

II.  UCSC BROWSER

Examples of Tracks

a)      Known genes—protein encoding and corresponding mRNA, placed by BLAT , >98% and best placement. If alternate splicing (longest hit and best match )

b)      RefSeq—same as above except based on set from Locus Link

c)      mRNA—alignments between mRNA in Genbank, BLAT, all alignments within 1% of the best alignment are kept, >95% sequence identity.

d)      ESTs---all ESTs from Genbank, same as mRNA except intron max of >500 kb, >93% identity

e)      STS—genetic (Genethon, Marshfield, decode); radiation Hybrid (Stanford, WhiteheadRH) or YAC mapping (Whitehead map

A. UCSC Genome Browser – What is it?

The Genome Browser stacks annotation tracks from a variety of sources beneath genome coordinate positions, allowing rapid visual correlation of different types of information. The user can look at a whole chromosome to get a feel for gene density, open a specific cytogenetic band to see a positionally mapped disease gene candidate, or zoom in to a particular gene to view its spliced ESTs and possible alternative splicing. The Genome Browser itself does not draw conclusions; rather, it collates all relevant information in one location, leaving the exploration and interpretation to the user.

B. What does this mean to me?

This means that with one tool you can view the results of a collaborative effort involving thousands of people from the international biomedical research community. If your research involves one of the species that has been incorporated into the browser (presently Mouse, Rat and Human) then you can easily visualize the research others have done in the genomic regions that are of interest. Not only can you visualize the work of others but with only a trivial amount of effort you can implement your own research and provide this information to others.

C. How can I provide my information to others?

There are two methods that can be used to share data.

1. Temporary Tracks

2. Permanent Tracks

D. . What is a Temporary Track and How do I create them?

Temporary Tracks allow you to place data on the browser to visualize and compare your data with other research that has been done on the underlaying sequence. This data exists as long as the browser is open and is temporary in nature because it has to be reloaded each time you visit the browser.

Creating a custom annotation track

Genome Browser annotation tracks are based on files in line-oriented format. Each line in the file defines a display characteristic for the annotation track or defines a data item within the track. Annotation files contain 3 types of lines: browser lines, track lines, and data lines. Empty lines and lines starting with # in the annotation file are ignored.

To construct an annotation file and display it in the Genome Browser, follow these steps:

Step 1. Format the data set

Formulate your data set as a tab-separated file using one of the formats supported by the Genome Browser. Annotation data can be in standard GFF,GTF, PSL, or BED format. You may include more than one data set in your annotation file. However, all of the data lines for a given annotation track must be in the same format.

Step 2. Define the Genome Browser display characteristics

Add one or more optional browser lines to the beginning of your formatted data file to configure the overall display of the Genome Browser when it initially displays your annotation data. Browser lines allow you to configure such things as the genome position that the browser will initially open to, the width of the display, and the configuration of the other annotation tracks that are shown (or hidden) in the initial display. NOTE: If the browser position is not explicitly set

in the annotation file, the initial display will default to the position setting most recently used by the user, which may not be an appropriate position for viewing the annotation track.

Step 3. Define the annotation track display characteristics

Following the browser lines - and immediately preceding the formatted data - add a track line to define the display attributes for your annotation data set.

Track lines allow you to define annotation track characteristics such as the name, description, colors, initial display mode, use score, etc. If you have included more than one data set in your annotation file, insert a track line at the beginning of each new set of data.

Step 4. View your annotation track in the Genome Browser

To view your annotation data in the Genome Browser, open up the UCSC Genome Bioinformatics home page (http://genome.ucsc.edu/), select the type of genome on which your annotation data is based, and then click on the Genome Browser link in the top menu bar. On the Genome Browser Gateway page that displays, select the assembly on which your annotation data is based, then scroll down to the Add Your Own Tracks section at the bottom of the page.

Upload your annotation file by entering the name of your file in the Annotation File box or by pasting the contents of your file into the large edit box. Scroll back to the top of the page and click the Submit button to display the Genome Browser track window with your annotation.

To upload a custom annotation track from another machine or web site, paste the URL of the track into the large edit box. Custom tracks can be displayed in conjunction with ordinary BLAT tracks.

Step 5. (Optional) Add details pages for individual track features

After you've constructed your track and have successfully displayed it in the Genome Browser, you may wish to customize the details pages for individual track features. The Genome Browser automatically creates a default details page for each feature in the track containing the feature's name, position information, and a link to the corresponding DNA sequence. To view the details page for a feature in your custom annotation track (in full display mode), click on the item's label in the annotation track window.

You can add a link from a details page to an external web page containing additional information about the feature by using the track line url attribute. In the

annotation file, set the url attribute in the track line to point to a publicly available page on a web server. The url attribute substitutes each occurrence of '$$' in

the URL string with the name defined by the name attribute. You can take advantage of this feature to provide individualized information for each feature in

your track by creating HTML anchors that correspond to the feature names in your web page.

Step 6. (Optional) Sharing your annotation track with others

The previous steps showed you how to upload annotation data for your own use on your own machine. However, many users would like to share their annotation data with members of their research group on different machines or with colleagues at other sites. To make your Genome Browser annotation track viewable by others, follow the steps below. (Note that some of the URL examples in this section have been broken up into 2 lines for documentation display purposes).

Step 1. Put your formatted annotation file on your web site. Be sure that the file permissions allow it to be read by others.

Step 2. Construct a URL that will link this annotation file to the Genome Browser. The URL must contain 3 pieces of information specific to your annotation data:

1. The genome freeze on which your annotation data is based. This information is of the form db=database_name, where database_name is to the UCSC code for the genome freeze. For a list of these codes, see the Genome Browser FAQ.

2. The genome position that the Genome Browser should initially open to.

3. The URL of the annotation file on your web site.

Step 3. Combine the above pieces of information into a URL of the following format (the information specific to your annotation file is highlighted):

http://genome.ucsc.edu/cgibin/hgTracks?db=database_name&position=chr_position&hgt.customText=URL.

E. What is a Permanent Track and How do I create them?

A permanent track are tracks that are hard coded (programmed into the source code) of the browser and are static and ‘Stable’. They allow a greater degree of data manipulation as the source code can be modified to display a various information about the region represented.

Permanent tracks of data are submitted to UCSC and incorporated by them into the code. Permanent tracks on the local CWRU version of this browser are added by the Computer Support Group.

Appendix A. Browser Lines

Browser lines configure the overall display of the Genome Browser window when your annotation file is uploaded. Each line defines one display attribute.

Browser lines consist of the format:

browser attribute_name attribute_value(s)

The following browser line attribute name/value options are available:

position <position> - Determines the part of the genome that the Genome Browser will initially open to, in chromosome:start-end format.

pix <width> - Sets the Genome Browser window to the specified width in pixels.

hide all - Hides all annotation tracks except for custom ones.

hide <track_name(s)> - Hides the listed tracks. Tracks must be referenced by their

symbolic names. Multiple track names should be space-separated.

dense all - Displays all tracks in dense mode.

dense <track_name(s)> - Displays the specified tracks in dense mode. Symbolic

names must be used. Multiple track names should be space-separated.

full all - Displays all tracks in full mode.

full <track_name(s)> - Displays the specified tracks in full mode. Symbolic names

must be used. Multiple track names should be space-separated.

Note that the Genome Browser will open to the range defined in the Gateway page position box or the position saved as the default unless the browser line position attribute is defined in the annotation file. Although this attribute is optional, it's recommended that you set this value in your annotation file to ensure that the track will appear in the display range when it is uploaded into the Genome Browser.

Appendix B. Track Lines

Track lines define the display attributes for all lines in an annotation data set. If more than one data set is included in the annotation file, each group of data must be preceded by a track line that describes the display characteristics for that set of data. A track line begins with the word track, followed by one or more attribute=value pairs. Unlike browser lines - in which each attribute is defined on a separate line - all of the track attributes for a given set of data are listed on one line with no line breaks. The inadvertent insertion of a line break into a track line will generate an error when you attempt to upload the annotation track into the Genome Browser.

The following track line attribute=value pairs are defined in the Genome Browser:

name=<track_label> - Defines the track label that will be displayed to the left of the track in the Genome Browser window, and also the label of the track control at the bottom of the screen. The name can consist of up to 15 characters, and must be enclosed in quotes if the text contains spaces. The default value is User Track.

description=<center_label> - Defines the center label of the track in the Genome Browser window. The description can consist of up to 60 characters, and must be enclosed in quotes if the text contains spaces. The default value is User Supplied Track.

visibility=<display_mode> - Defines the initial display mode of the annotation track. Values for display_mode include: 0 for hide, 1 for dense display mode, and 2 for full display mode. The default is 1.

color=<RGB,RGB,RGB> - Defines the main color for the annotation track. The track color consists of three comma-separated RGB values from 0-255. The default value is 0,0,0 (black).

altColor=<RGB,RGB,RGB> - Defines the secondary color for the track. The alternate color consists of three comma-separated RGB values from 0-255. The default is a lighter shade of whatever the color attribute is set to.

useScore=<use_score> - If this attribute is present and is set to 1, the score field in each of the track's data lines will be used to determine the level of shading in which the data is displayed. The track will display in shades of gray unless the color attribute is set to 100,50,0 (shades of brown) or 0,60,120 (shades of blue). The default setting for useScore is 0.

priority=<priority> - Defines the display position of the track relative to other tracks in the Genome Browser window.

offset=<offset> - Defines a number to be added to all coordinates in the annotation track. The default is 0.

url=<external_url> - Defines a URL for an external link associated with this track. This URL will be used in the details page for the track. Any '$$' in this string this will be substituted with the item name. There is no default for this attribute.

Appendix C. Dataset Formats

BED Lines

BED format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional

optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track.

The first three required BED fields are:

1.chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or contig (e.g. ctgY1).

2.chromStart - The starting position of the feature in the chromosome or contig. The first base in a chromosome is numbered 0.

3.chromEnd - The ending position of the feature in the chromosome or contig. The chromEnd base is not included in the display of the feature. For

example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

The 9 additional optional BED fields are:

4.name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode.

5.score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray).

6.strand - Defines the strand - either '+' or '-'.

7.thickStart - The starting position at which the feature is drawn thickly (for

example, the start codon in gene displays).

8.thickEnd - The ending position at which the feature is drawn thickly (for

example, the stop codon in gene displays).

9.reserved - This should always be set to zero.

10.blockCount - The number of blocks (exons) in the BED line.

11.blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.

12.blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items

in this list should correspond to blockCount.

Example:

Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1

chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512

chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
PSL Lines

PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. See the BLAT documentation for more details. All of the

following fields are required on each data line within a PSL file:

1.matches - Number of bases that match that aren't repeats

2.misMatches - Number of bases that don't match

3.repMatches - Number of bases that match but are part of repeats

4.nCount - Number of 'N' bases

5.qNumInsert - Number of inserts in query

6.qBaseInsert - Number of bases inserted in query

7.tNumInsert - Number of inserts in target

8.tBaseInsert - Number of bases inserted in target

9.strand - '+' or '-' for query strand. In mouse, second '+'or '-' is for genomic strand

10.qName - Query sequence name

11.qSize - Query sequence size

12.qStart - Alignment start position in query

13.qEnd - Alignment end position in query

14.tName - Target sequence name

15.tSize - Target sequence size

16.tStart - Alignment start position in target

17.tEnd - Alignment end position in target

18.blockCount - Number of blocks in the alignment

19.blockSizes - Comma-separated list of sizes of each block

20.qStarts - Comma-separated list of starting positions of each block in query

21.tStarts - Comma-separated list of starting positions of each block in target

Example:

Here is an example of an annotation track in PSL:

track name=fishBlats description="Fish BLAT" useScore=1

59 9 0 0 1 823 1 96 +- FS_CONTIG_48080_1 1955 171 1062 chr22 47748585 13073589 13073753 2 48,20, 171,1042, 34674832,34674976,

59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr22 47748585 13073626 13073747 2 21,45, 2456,2532, 34674838,34674914,

59 7 0 0 1 55 1 55 -+ FS_CONTIG_26780_1 2825 2455 2576 chr22 47748585 13073627 13073748 2 45,21, 249,349, 13073627,13073727,

Be aware that the coordinates for a negative strand in a PSL line are handled in a special way. In the qStart and qEnd fields, the coordinates indicate the position where the query matches from the point of view of the forward strand, even when the match is on the reverse strand. However, in the qStarts list, the

coordinates are reversed.

Example:

Here is a 30-mer containing 2 blocks that align on the minus strand and 2 blocks that align on the plus strand (this sometimes can happen in response to assembly errors):

0 1 2 3 tens position in query

0123456789012345678901234567890 ones position in query

++++ +++++ plus strand alignment on query

-------- ---------- minus strand alignment on query

Plus strand:

qStart=12

qEnd=31

blockSizes=4,5

qStarts=12,26

Minus strand:

qStart=4

qEnd=26

blockSizes=10,8

qStarts=5,19

Essentially, the minus strand blockSizes and qStarts are what you would get if you reverse-complemented the query. However, the qStart and qEnd are not reversed. To convert one to the other:

qStart = qSize - revQEnd

qEnd = qSize - revQStart

GFF Lines

GFF (General Feature Format) lines are based on the GFF standard file format. GFF lines have nine required fields that must be tab-separated. If the fields

are separated by spaces instead of tabs, the track will not display correctly. For more information on GFF format, refer to http://www.sanger.ac.uk/Software/formats/GFF.

Here is a brief description of the GFF fields:

1.seqname - The name of the sequence. Must be a chromosome or a contig.

2.source - The program that generated this feature.

3.feature - The name of this type of feature. Some examples of standard feature

types are "CDS", "start_codon", "stop_codon", and "exon".

4.start - The starting position of the feature in the sequence. The first base is

numbered 1.

5.end - The ending position of the feature (inclusive).

6.score - A score between 0 and 1000. If the track line useScore attribute is set to

1 for this annotation data set, the score value will determine the level

of gray in which this feature is displayed (higher numbers = darker gray). If there

is no score value, enter ".".

7.strand - Valid entries include '+', '-', or '.' (for don't know/don't care).

8.frame - If the feature is a coding exon, frame should be a number between 0-2

that represents the reading frame of the first base. If the feature is not a

coding exon, the value should be '.'.

9.group - All lines with the same group are linked together into a single item.

Example:

Here's an example of a GFF-based track. NOTE: If you paste this example into the Genome Browser, check that all fields remain tab-separated (some cut

and paste operations may inadvertently replace the tabs with spaces).

track name=regulatory description="TeleGene(tm) Regulatory Regions"

chr22 TeleGene enhancer 1000000 1001000 500 + . touch1

chr22 TeleGene promoter 1010000 1010100 900 + . touch1

chr22 TeleGene promoter 1020000 1020000 800 - . touch2

GTF Lines

GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been

expanded into an attribute field that includes a list of semicolon-separated attribute/value pairs. For more information on this format, see

http://genes.cs.wustl.edu/GTF2.html.

Some examples of entries for the attribute field include:

gene_id value - A globally unique identifier for the genomic source of the sequence.

transcript_id value - A globally unique identifier for the predicted transcript.

Example:

Here is an example of the ninth field in a GTF data line:

gene_id Em:U62317.C22.6.mRNA; transcript_id Em:U62317.C22.6.mRNA;exon_number 1

   The Genome Browser groups together GTF lines that have the same transcript_id value. It only looks at EXON and CD