PgmNr 1707: Comprehensive haplotype resolved MHC sequences from whole genome shotgun sequencing from single individual.Authors:
J. Chin 1; A. Dilthey 2; A. Fungtammasan1 1; S. Garg 3; E. Garrison 4; M. Rautiainen 5; M. Tobias 6; J. Wanger 7; Q. Zeng 8; J. Zook 7; The MHC team for Pan-genomics in the Cloud hackathon 2019
View Session Add to Schedule
1) DNAnexus, Inc., Mountain View, California.; 2) Institute of Medical Microbiology University Hospital of Dusseldorf; 3) Department of Genetics, Harvard Medical School, Boston, MD; 4) Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK; 5) Center for Bioinformatics, Saarland University, Saarbrucken, Germany; 6) Max Planck Institute for Informatics, Saarbrucken, Germany; 7) Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD; 8) LabCorp, Inc., NC
During the three-day Hackathon organized by UCSC gathering researchers and developers to explore approaches for pan-genomic DNA sequencing analysis in March 2019, we formed a team targeting resolving the MHC regions of a single individual HG002. The recent advance in long-read DNA sequencing technologies has made routinely assembling new human genomes less-daunting tasks than before so we can get a comprehensive view of each human genome in the near future. Meanwhile, not all regions of a human genome will be resolved equally due to variable complexity of the genome sequences. Major Histocompatibility Complex region is one of the many such examples. Given the medical importance of this region, various targeted approaches have been developed before to get additional haplotype resolved sequences that currently are as ALT contigs in GRCh38. In this work, we explore the possibility to reconstruct the haplotype resolved contigs for MHC using just single individual whole genome shotgun sequencing data.
While there are already many de novo human genomes published recently, most of them do not have fully resolved contigs for the MHC region due to either the accuracy or read-length limitation. Our work examines all currently available data from multiple sequencing vendors (10x Genomics, Pacific Biosciences, Oxford Nanopore Technologies) for HG002 collected by the Genome In A Bottle project. With the multiple independent sequencing datasets to complement to each other, we derive a strategy to increase the accuracy on phasing the SNPs and reads for a haplotype-resolved assembly. The multiple technology approaches overcome the accuracy and read-length limitation of each technology used along. We build reproducible pipelines in Jupyter notebooks for assembling de novo contigs, one for each haplotype, spanning the whole MHC region. We construct variant graphs representing intricate large-scale differences to a reference. The SNP-phasing accuracy is validated with the phased variants using the maternal and the paternal genomes. A full spectrum of all variations is identified in the MHC region from the variant graphs and contigs. We hope such comprehensive variation catalog of the MHC region will lead to the insight into the associated biology. Our reproducible pipeline ensures the reproducibility for such complicated bioinformatics tasks and makes the approach re-usable for future work on building better MHC genomic sequences for future pan-human genome projects.