PgmNr 1697: A scalable framework for identifying genetic variant set associated with polygenic-traits in UK Biobank.Authors:
Y.-C. Hwang 1; P. Nguyen 2; B.T. Hannigan 1; J. Chin 2
View Session Add to Schedule
1) Science, DNAnexus, Mountain View, CA.; 2) Deep Learning, DNAnexus, Mountain View, CA.
Genome-wide association studies (GWAS) have been instrumental for discovering disease and trait-associated genetic variants, typically single-nucleotide polymorphisms (SNPs). While GWAS can identify SNPs that are marginally associated with traits using univariate tests, most traits in humans are polygenic – where a trait is influenced by more than one gene. Recently, lasso (least absolute shrinkage and selection operator) has been used as a multivariate prediction model for selecting a set of relevant SNPs that are useful for predicting a phenotype of interest. This regression method simultaneously performs variable selection and estimation.
UK Biobank, a large prospective population-based cohort study, collects extensive genotypic and phenotypic data of 500,000 individuals, aged 40-69 years, from the UK. The study includes genome-wide genotyping data (805,426 measured variants per individual), and more than 2000 health-related phenotypes. This ultra-high dimension, large-scale cohort makes finding subsets of genes associated with a polygenic trait statistically possible in human populations. However, the data size leads to a computational challenge in fitting the entire cohort into limited storage and memory. Recently, Qian et al. proposed a batch screening iterative lasso (BASIL) algorithm that reduces the problem to a manageable size by implementing lasso in an iterative fashion and parallelizing the screening problem in each iteration. Though it works with subsets of predictors, it does not compensate by providing an approximate solution. They implemented BASIL as a highly optimized R package, snpnet, and provided examples of finding SNP sets that can predict two quantitative traits (height and BMI) and two qualitative traits (asthma and high cholesterol). Using BASIL, they were able to calculate results 20% faster than other alternatives, and with better I/O efficiency.
In this study, we packaged snpnet into a portable docker image and deployed it into a cloud environment. After reproducing the discovery of SNP subsets associated with the original four phenotypes, we further extended the analysis to another 1000 phenotypes. Leveraging the flexible capacity of cloud computing, we are able to discover all SNP sets associations efficiently without worrying about overloading a local server. The snpnet docker image and cloud app enables easy adoption of the tool and leverages the massive cloud resource for analyzing high-dimensional datasets like UK Biobank.