PAG-XVIII  Plant & Animal Genomes XVIII Conference

January 9-13, 2010
Town & Country Convention Center
San Diego, CA



W112 : Cacao Genome Sequencing


The Need For An Assembly Pilot Project

David N Kuhn1 , Chris A Saski2 , F. Alex Feltus2,3 , Niina Haiminen4 , Dorrie Main5 , Greg D May6 , Raymond J Schnell1 , Juan C Motamayor1,7 , Keithanne Mockaitis8 , Brian Scheffler9 , Howard Shapiro7

1  USDA-ARS Subtropical Horticulture Research Station (SHRS), Miami, FL 33158, USA.
2  Clemson University Genomics Institute, Clemson University, Clemson, SC 29634, USA.
3  Department of Genetics & Biochemistry, Clemson University, Clemson, SC 29634, USA.
4  IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA.
5  Washington State University, Department of Horticulture and Landscape Architecture, Pullman, WA 99164, USA.
6  National Center for Genome Resources, Santa Fe, NM 87505, USA.
7  Mars Incorporated, McLean, VA 22101, USA.
8  Indiana University Center for Genomics and Bioinformatics, Bloomington, IN 47405, USA.
9  USDA-ARS Genomics and Bioinformatics Research Unit, Stoneville, MS 38776, USA.

Progress has been rapid since the June 2008 start of the cacao genome sequencing project with the completion of the physical map and the accumulation of approximately 10x coverage of the genome with Titanium 454 sequence data of Matina1-6, the highly homozygous Amelonado tree chosen for the project. Our IBM collaborators have been analyzing the currently available software for sequence assembly and benchmarking it with synthetic datasets of various sizes and error rates. Serious concerns have been raised about the ability to assemble a genome the size of cacao (n=10,~460 Mb) de novo from 454 sequence data. The current assembly of 454 data (version 3) has 171,816 contigs (296 Mb) while the physical map produced at CUGI has only 295 contigs (representing >90% of the genome), 109 of which are anchored to the genetic recombination map. A pilot assembly project of the pooled BACs from the minimum tile path of a single contig (~3Mb) region of the cacao genome containing several disease resistance and horticultural QTLs has been proposed to determine if de novo assembly of a region of that size is possible from 454 sequence data. In addition, a subset of the BACs (~1 Mb) will be Sanger sequenced. To test the assembly pipeline, a synthetic dataset will be prepared with a distribution of read sizes and error to reflect those typically found in 454 sequence data. Successful assembly on the pilot scale will provide a strategy to complete the assembly of the genome sequence represented by the physical map.