January 9-13, 2010
Town & Country Convention Center
San Diego, CA
Niina Haiminen , Isidore Rigoutsos
With any genome project, assembling the genomic sequence from sequenced DNA and RNA fragments is a challenging task. In recent years, numerous algorithms and tools have been developed for putting together a genomic sequence from the output of next-generation short-read sequencing technologies. These tools are often reported to have good performance on small and repeat-free organisms, such as bacterial genomes. When increasing genome size, repeat content, and introducing sequencing errors in the reads, putting together a complete genomic sequence becomes much more challenging.
In conjunction with the T. cacao sequencing project, IBM is performing studies on synthetic data that help validate the output of the assembly tools applied. A region from, e.g. rice reference sequence is chosen, sequencing reads originating from that region are simulated, and the synthetic reads and any other simulated data are run through the assembly algorithm or pipeline. The resulting assembly is compared to the known reference to pinpoint any problems in the assembly process.