Virtually all genome assemblers to date are designed for use with data from haploid or homozygous diploid genomes. Their use on heterozygous genomic datasets generally results in highly-fragmented, error-prone assemblies, owing to the violation of assumptions during both the contigging and scaffolding phases. Of the two phases, scaffolding is more particularly impacted and algorithms to facilitate the scaffolding of heterozygous data are lacking. We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. A fundamental step in the scaffolding phase is the assignment of sequence orientations to contigs within scaffolds. Deciding such an assignment in the presence of ambiguous evidence is what is termed the contig orientation problem. We define this problem using bidirected graph theory and show that it is equivalent to the weighted MAX-CUT problem. We present a greedy heuristic solution which we comparatively assess with other solutions to the contig orientation problem, including an advanced MAX-CUT heuristic. We illustrate how a solution to this problem provides a simple means of simultaneously identifying inverted haplotypes, which are uniquely found in diploid genomes and which have been shown to be involved in the genetic mechanisms of several diseases. Ultimately our findings show that due to the inherent biases in the underlying biological model, a greedy heuristic algorithm performs very well in practice, retaining a higher total percent of edge weight than a branch-and-bound semidefinite programming heuristic. This application exemplifies how existing graph theory algorithms can be applied in the development of new algorithms for more accurate assembly of heterozygous diploid genomes.



College and Department

Physical and Mathematical Sciences; Computer Science



Date Submitted


Document Type





Genome Assembly, Scaffolding, Weighted MAX-CUT Algorithms, Bidirected Graph Theory