Annotation of BAC Sequences with SUGAR

The Sea Urchin Gene AnnotatoR (SUGAR) was originally designed by Alistair Rust and built on top of the publicly available Genotator annotation workbench. More recently, SUGAR has been rewritten as part of the Cartwheel and FamilyRelations package by Titus Brown and others. By summarizing all related search results in a single graphical window, SUGAR provides a convenient overview which can be interactively interrogated for more detail.

SUGAR typically displays one BAC at a time, typically 120 kb - 150 kb of concatenated contigs of different lengths. In the main SUGAR window, individual ordered contigs are shown in alternating colors on a wide horizontal bar labeled "ordered contigs". Data are stored in two formats: ACEdb, and GFF (General Feature Format), which is a generic format useful for data exchange for example with our FamilyRelations program.

SUGAR presents the user with a great deal of useful information simultaneously. Blast search results against a number of databases are displayed in graphical form aligned against the BAC sequence. These include SwissProt, GenBank, known cDNAs, sea urchin ESTs, S. purpuratus repeat sequences and BAC ends.

SwissProt hits are displayed at two different levels of significance for ease of visual analysis. Forward strand hits are colored red, while reverse strand hits are shown in green. Clicking the mouse on a colored bar representing a SwissProt hit opens up another browser window in which details of the sequence alignment and other information are displayed. In addition, the top 25 hits are listed in descending order of significance. Clicking on an item in this table takes the browser to the more detailed alignment view.

The protein matches and exon detection markers are color-coded to distinguish between forward and reverse strand hits. The BAC-end hits (this data base contains 76,000 BAC end sequences; Cameron et al., 2000) are color-coded to distinguish between "unique" matches (colored orange, these are in fact matches to three or fewer BAC-ends), and "repeats" (colored black, these are sequences matching larger numbers of BACs). Again, clicking on a block opens a new browser display which shows the results in more detail. At the top of this display is a more a detailed graphic of the BAC-end alignments. This is followed by a table of unique back end accession names, the individual alignment information, and histograms of the number of BAC-ends in a repeat block.

Note that in the main SUGAR display window (see individual gene links), below the line representing the BAC sequence, SUGAR displays hits by a number of gene/exon identification programs. These are:

  • HMMGene is a program for prediction of genes in anonymous DNA. The methods used are described in the paper: A. Krogh: Two methods for improving performance of an HMM and their application for gene finding. In Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186.
  • GenScan is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. For each sequence, the program determines the most likely "parse" (gene structure) under a probabilistic model of the gene structural and compositional properties of the genomic DNA for the given organism.
  • GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAs). In the second step, exons are built from the sites.

Forward-strand hits are displayed in blue and reverse strand hits in green. Slider controls allow the user to zoom in and out and pan right and left as desired.

Identification of putative coding or regulatory regions using SUGAR is very much a matter of reviewing and weighing the evidence on a case by case basis. For example, a region marked with many "SU repeat table" hits and displaying black (repeat) BAC-end markings would probably not be considered as a significant regulatory or coding segment, even if other searches such as SwissProt or any of the exon finders also show hits in that exact region (as might occur in the case of a recently inactivated pseudogene). Similarly, a region in which multiple results concur can be assumed to be more likely to be a true positive.