Enter Note Done
Decrease font size Increase font size

PgmNr 1535: High-level optimizations over query engines ensemble: Accelerating distributed genomic data science.

Authors:
A. Szmurlo; M. Wiewiórka; T. Gambin

View Session  Add to Schedule

Affiliation: Computer Science Institute, Warsaw University od Technology, Warszawa, mazowieckie, Poland


Background: Nucleic acid sequencing is well adopted in performing molecular diagnosis and is the very core of large-scale research projects producing immense amount of data which are meaningless until further examined by algorithms. Presently computational analysis step is the major bottleneck in clinics as well as sequencing research projects however there is a handful of ideas for improving performance of bioinformatics pipelines. Although one of the possibilities, distributed computing is reasonable, available distributed software suffers from either limited functionality, or rather poor overall performance as it is usually being optimized for one of the two kinds of analyses (single sample analysis for clinical purposes vs. cohort/case control research studies).

Methods: In recognition of various access patterns instead of using single query engine we plan to construct and use an ensemble of data stores each optimized for specific conditions. We designed efficient primary model exposing different views of the data and helper structures with synchronization mechanisms. Our high-level optimizer, aware of these elements, is able to route the query to the suitable component which will most certainly be superior under encountered conditions.

Benchmarking: We have performed tests on 1000 Genomes Project data on our Hadoop-based cluster. We have confirmed that both efficient implementation of appropriate distributed algorithms and routing query to suitable query engine significantly reduces the computation time for painful bioinformatics operations.

Conclusions: We expect that optimized access to data will streamline tertiary analyses, resolve the challenge of manipulating heterogeneous genomic data sets and will lay down foundation to wider adoption of data-heavy machine learning methods bringing a new stimuli into the personalized medicine.