
In this post, we present results from of our model trained to predict antiviral activity against COVID-19. The main technical challenge in designing and training such models is still the lack of sufficient data — molecules labeled with associated antiviral activity. Despite multiple on-going experimental efforts, data made available in public domain is very limited. For instance, this paper (Jeon et al., 2020) reports testing of 48 FDA approved molecules, of which 24 are identified as potent. On its own, the dataset is clearly insufficient for training a neural model. Therefore, we have to find a way to augment the data with other relevant sources that can further inform the model. We identified two such sources. The first one consists of molecular fragments that bind to SARS-COV-2 main protease, obtained via crystallography screening by the Diamond Consortium. The second source comes from readily available screening of a close relative of the virus, SARS-CoV-1. This data is relevant since SARS-CoV-1 and SARS-CoV-2 proteases are similar (79.5% sequence identity). While both are clearly pertinent, they also differ significantly from COVID-19 screens. For instance, binding fragments from Diamond dataset are much smaller than molecules we are trying to classify. This prevents us from using standard property prediction models that we and others have successfully used in the past for in-silico screening.
The key technological challenge is to be able to estimate models that can extrapolate beyond their training data, e.g., to different chemical spaces. The ability to extrapolate implies a notion of invariance (being impervious) to the differences between the available training data and where predictions are sought. In other words, the differences in chemical spaces can be thought as "nuisance variation" that the predictor should be explicitly forced to ignore. To this end, we introduce a novel approach that builds on and extends recently proposed invariant risk minimization, adaptively forcing the predictor to avoid nuisance variation. We achieve this by continually exercising and manipulating latent representations of molecules to highlight undesirable variation to the predictor. This method is specifically tailored to rich, combinatorially defined environments typical in molecular contexts. For full details of the method, please see the paper.

How well does this method works? The real answer to this question involves experimental testing of candidate molecules. We are currently investigating this question with our collaborating scientists from WRAIR, sponsored by Darpa. We ran our model on Broad Repurposing Library (6246 molecules) and WRAIR internal library (800K molecules) from which our WRAIR collaborators selected 20 molecules for further testing in the lab. For the predictions of our model on Broad library, click here. Meanwhile, we can get some initial assessment of model quality by utilizing efficacy data we already have. In the cross-validation scenario, we obtain the AUROC of 0.893, compared to the best baseline — 0.820. All the stated caveats should raise caution in interpreting the results but we did see that the new model consistently outperformed all the tested baselines. We will update you as new experimental data becomes available.