103 - Automated Rapid Screening of DNA Sequences for Threat Identification
Author Block: J. M. Schuetter, M. A. Schwemmer, O. P. Tabbaa; Battelle Mem. Inst., Columbus, OH
As the technology for DNA synthesis and manipulation improves, restrictions on the resources and expertise necessary to synthetically create biological threats are decreasing. As a result, individuals can more easily acquire material that can be used to generate intentional or accidental biothreats, creating the potential for a biosecurity crisis. In anticipation, the biosecurity community has begun to develop procedures for screening and identifying potentially threatening DNA sequence data, but these methods typically suffer from lack of standardization, high false positive rates, and time-consuming manual review. Through Battelle’s work on the IARPA Functional Genomic and Computational Assessment of Threats (Fun GCAT) program, several deep learning models have been developed to predict the threat level of a nucleotide or amino acid sequence using no additional information beyond the sequence itself. The models are trained and evaluated using both a curated Sequence of Concern (SoC) database and SwissProt. The curated SoC database includes high-quality metadata on properties of SoCs that are used in conjunction with standardized threat logic to categorize known SoCs into four threat bins. The models are subsequently trained to mimic the threat logic’s standardized categorization of SoCs, using only a sequence as an input. Initial model results demonstrated 84% accuracy in predicting the binary threat status (Threat vs. Non-Threat) and roughly 77% in predicting the more granular four category threat status, which ranges from negligible to severe. Furthermore, the model predictions are made in less than two seconds per sequence, allowing this procedure to be used as a link in a decision-making pipeline (as on Fun GCAT), or as a screening or triage step in the manual review process.