High-throughput DNA sequencing technologies are currently revolutionizing the fields of biology and medicine by elucidating the structure and function of the components of life. Modern DNA sequencing machines typically produce relatively short reads of DNA which are then assembled by software in an attempt to produce a representation of the entire genome. Due to the complex structure of all but the smallest genomes, especially the abundant presence of exact or almost exact repeats, all genome assemblers introduce errors into the final sequence and output a relatively large set of contigs instead of full-length chromosomes (a contig is a DNA sequence built from the overlaps between many reads). These problems are dramatically worse when homologous copies of the same chromosome differ substantially. Currently such genomes are usually avoided as assembly targets and, when they are not avoided, they generally produce assemblies of relatively low quality. An improved algorithm for the assembly of such data would dramatically improve our understanding of the genetics of a large class of organisms. We present a unique algorithm for the assembly of diploid genomes which have a high degree of variation between homologous chromosomes. The approach uses coverage, graph patterns and machine-learning classification to identify haplotype-specific sequences in the input reads. It then uses these haplotype-specific markers to guide an improved assembly. We validate the approach with a large experiment that isolates and elucidates the effect of single nucleotide polymorphisms (SNPs) on genome assembly more clearly than any previous study. The experiment conclusively demonstrates that the Bioluminescence heterozygous genome assembler produces dramatically longer contigs with fewer haplotype-switch errors than competing algorithms under conditions of high heterozygosity.



College and Department

Physical and Mathematical Sciences; Computer Science



Date Submitted


Document Type





genome, genome assembly, polymorphic, polymorphism, heterozygous, haplotype, algorithm