PAG-XIII  Plant & Animal Genomes XIII Conference

January 15-19, 2005
Town & Country Convention Center
San Diego, CA



P042 : Genome Sequencing & ESTs


Biologist-Friendly Pipeline For Processing And Masking Sequences

Marta Matvienko1 , Alexander Kozik2

1  Allometra, 1950 Fifth Street, Davis, CA, 95616, USA
2  Genome Center at UC Davis, One Shield Ave, Davis, CA, 95616, USA

Publicly available and newly generated sequences are often contaminated with vector sequences, poly A tails, and linkers. The use of such sets for BLAST comparisons may result in artificial hits because of sequence contaminations, or an absence of hits due to low information content. To address this concern, we developed a new function, Sequence Processor for our program PyMood™. This tool removes and masks undesired sequences from FASTA files.
To demonstrate the utility of this tool, we analyzed the cocoa EST sequences from GenBank. The sequences were sorted according to:


  1. Their level of homology to the reference sequence file that contained all GenBank vector sequences and repeats.

  2. The sequence composition: length, percentage of "N" letters, and "GC" content.


In less than 10 minutes the cocoa set was separated into a FASTA file containing usable sequences and into a file with unusable sequences. The undesired portions of usable sequences were masked. We compared the result of our cocoa EST processing (followed by sequence assembly), with the existing cocoa gene index from TIGR.
The PyMood Sequence Processor allows for the selection of cutoff parameters for sorting sequences in a user-friendly graphical interface. Any lab biologist can easily run the program on Windows or Mac desktops/laptops. The outcomes of selecting different cutoffs for data processing are presented. The same function can also be utilized for removing and/or masking particular protein domains or motifs.
The resulting Theobroma cacao FASTA files are available for download from http://allometra.com