PgmNr 221: A robust and production-level approach to haplotype-resolved assembly of single individuals.Authors:
S. Garg 1; C. Fungtammasan 5; A. Carroll 6; R. Hall 4; E. Hatas 4; M. Mahmoud 2; F. Sedlazeck 2; M. Chou 1; J. Aach 1; J. Zook 3; J. Chin 5; G. Church 1
View Session Add to Schedule
1) Harvard Medical School, Boston, MA.; 2) Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston TX 77030; 3) Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899; 4) Pacific Biosciences, Menlo Park, California; 5) DNAnexus, Mountain View, California; 6) Google Genomics, Mountain View, California
Reconstructing the complete and phased sequence of every chromosome copy in a human individual is a high priority goal for medical and population genetics. Most current approaches collapse both phased information into a single assembly, discarding phase information. Although efforts have been made to reconstruct these phased sequences, they either require >200 CPU hours or fail to assemble continuous haplotype sequences. There is a pressing need for a streamlined, production-level approach that can reconstruct high-quality phased sequences, and that can be applied to hundreds of human genomes.
Here, we propose an integrative de novo assembly and phasing strategy that leverages new forms of long-read and long-range connectivity data in a computationally efficient manner. Specifically, our approach combines complementary high throughput sequencing and connectivity datasets such as PacBio CCS and Hi-C, constructs a preliminary high-quality haploid consensus, and then conducts an optimized partitioning of reads and complete separate assembly of each homolog within a single integrated algorithm. Our approach produces high-quality diploid assemblies (excluding centromeres), is highly scalable, and can be integrated to the cloud platform for production-level assemblies of multiple single genomes. An additional advantage is that it excludes any reference sequence bias that could interfere with discovery of sequences unique to particular individuals or populations and allows for the detection of novel structural variants.
We demonstrate the feasibility of our approach on three genomes from the Personal Genome Project (PGP-1), the Genome in a Bottle project (HG002) and the 1000 Genome Project (NA12878), produce highly continuous haplotype-resolved assemblies with N50 of 15.4 Mb, and show that we require as little as 20x coverage of PacBio CCS and 30x of Hi-C to generate high-quality assemblies. We also discover novel phased sequences not included in GRCh38 and private to each genome. We validate these novel phased sequences against BAC or trio data.
In summary, our novel computational approach efficiently and robustly combines data from new sequencing and genome connectivity mapping technologies to produce high quality diploid assemblies that will support community research goals of producing accurate end-to-end finished human genomes of individuals, and so lead to improvements in personalized medicine and increased understanding of human genome sequence diversity.