An application of machine learning for DNA-encoded library screening

A post in the Google AI Blog features a paper by McCloskey et al (Journal of Medicinal Chemistry 2020), reporting an application of machine learning to DNA-Encoded Library (DEL) screening results. Here are my learning notes.

Principles of DNA-encoded small-molecule libraries
Applying machine learning to DEL screening
Conclusions and thoughts

Principles of DNA-encoded small-molecule libraries

Screening with DNA-encoded small-molecule libraries, often abbreviated as DEL (DNA encoded library), is a method used in early drug discovery, particularly in the phase of screening and hit identification. It complements other screening approaches such as high-throughput screening.

Building a DEL library reassembles the process of Lego building. The possible objects that can be built with a few logo blocks seem almost unlimited. The exponential growth of possible configurations of pre-defined building blocks and connections between them is key to the construction of DEL library.

What are the building blocks in DEL and connections between them? A building block in DEL is a small molecule that can react with other molecules. It will become a fragment of synthesized compounds that will be finally screened. A connection is a pre-defined chemical reaction that brings building blocks together and transforms them into intermediate or final products of the synthesis. Suppose that we have 100 building blocks, and they undergo four sequential chemical reactions. If every compound can interact with all other compound and itself during the reactions, and the products of the reactions are all different, after four reactions we can have 100*100*100*100, or a billion compounds.

Every child and parent who played with Lego knows the pain to find a particular block. How is the problem solved in DEL? It uses DNA as a barcode - a genius idea in my opinion - because (1) DNA can store information, (2) a piece of DNA can be extended by another piece of synthesized DNA to append some new information, and (3) DNA sequencing has been a mature technology with ever-increasing capacity and dwindling cost. At each step of the reaction, we can append the DNA barcode of each molecule with the additional barcode of the new building block. Finally, each molecule has its specific DNA sequence, which allows us to recover it. I wish sometimes I have such a Lego-finder as well at home!

In classical high-throughput screening based on enzyme affinity, a huge amount of protein is needed to test the affinity of each compound in the library to it. In contrast, during a DEL screening, we do not need to isolate molecules, since they come with their own barcodes. We can modify our target protein so that it is bound to a solid surface, incubate the surface with the DEL, wash away molecules that do not bind, and change the washing condition so that we get only the high-affinity molecules. By PCR and next-generation sequencing, we can reconstruct the structure of these molecules.

This is an extremely simplified illustration of how a DEL is built and how we can do a screening with it. In reality, there are many, many more variants, limitations, and nuisances to the principles described above. The use of DNA molecules, for instance, is not unproblematic because they are negatively charged and therefore may interfere with certain structures of encoded small molecules. Another apparent limitation is that the reactions that are amenable to DEL are much more limited than the ones medicinal chemists can usually use. This leads to the consequence that the chemical space covered by a DEL can be distinct from the space covered by classical high-throughput screening libraries.

Applying machine learning to DEL screening

An important application of machine learning is to learn important features and prominent patterns from a small dataset and use them to make predictions in a larger dataset. An advantage is that the entity of the larger dataset may not even exist - they can be merely numbers. We may produce them and probe them only when it appears interesting to us - known as the principle of made-on-demand.

This pattern is observed in many applications of machine learning. Gmail Smart Compose and many other tools that predicts the word we write next as we write is an example. Such tools uses large collection of texts, known as corpora (plural of corpus), to predict the next word. However large the corpora are, they do not necessarily cover all human language, especially not necessarily the sentence that we are writing and for sure not the things that we will write in the future. In this sense, we build machine learning models in a small dataset and apply them into a large dataset, part of which may not exist yet. In a Gmail program, we may choose to use the suggested word or not, therefore produce them and probe them only when it appears fitting. It is another question whether it is a good idea to rely on machine learning to prompt us about what we say and what we write. I hope, nevertheless, the parallel to the pattern that I described above is clear.

In the paper, the authors implemented such an approach as well. First, they run DEL screening with three well studied therapeutic proteins: soluble epoxide hydrolase (EPHX2, epoxide hydrolase 2), tyrosine-protein kinase KIT, and estrogen receptor alpha ESR1. In the second step, they processed and aggregated the sequencing data, which correspond to the molecule structures. In the third step, they trained machine learning models based on aggregated selection data, during the process of which they used no prior off-DNA activity measurements, and applied the models to virtually screen large molecules of compounds that are easily synthesizable or inexpensive purchasable (which exist though at this stage only as numbers). Finally, they used automated diversity filters, reactive substructure filters, and relied on a chemist’ review to select promising compounds, and tested them experimentally.

Two machine learning approaches are used, one is the random forest, the other one is based on graph convolutional neural network (GCNN, Kearnes et al., Journal of Computer-Aided Molecular Design 2016). GCNN is an instance of graph networks methods that I learned and blogged about the other day. The authors found that the GCNN models worked better than the random forest. And they reported that the machine learning models enriched hits: up to 29% hits were verified at one micromolar. In contrast, the hit rate is about 1% in traditional high-throughput screening. I think the baseline is set a bit too low because the knowledge derived from a DEL screening can even without a machine learning model inform hit selection, and hence the comparison is not all fair. However, the message of the authors is clearly delivered and well-received: results of a DEL screening can be used for virtual screening to identify other hits.

The study suggests that an integrated application of machine learning, DEL screening, and virtual screening can identify structurally diverse starting points for tool compound discovery and lead generation. For these needs, the paper’s approach deserves evaluation and implementation.

Conclusions and thoughts

The paper was a good and interesting read. Though the authors, for the purpose of the study, did not use any prior information about the targets and ligands. It can be imagined that in real-life settings prior information will be used to inform decisions.

DEL selection generates large quantity and high-quality data that are amenable to machine learning, and make-on-demand small molecule libraries are a source of low-cost, structurally diverse compounds for virtual screening. I argue that increasing hit diversity with DEL, machine learning, and virtual screening is an example of problems that matter and can be computationally solved.

The approach of training machine learning models from a small-scale screening, and applying them for a large-scale, virtual screening was also used in the paper A Deep Learning Approach to Antibiotic Discovery by Stockes et al, Cell 2020. I believe the concept can be also applied in other areas of hit identification.

As a side note, I felt relieved that the authors from Google and X-Chem did not use the term artificial intelligence. Ironically, the only time it appeared was in the name of the special issue in which the article is included: Artificial Intelligence in Drug Discovery. As a friend of mine observes: “Those who do cool staff towards intelligence do not use the word”.

I thank Iakov Davydov who pointed me to the blog post and the paper and Manfred Kansy for teaching me a lot about physical chemistry of drug discovery.

P.S.: Derek Lowe also shared his impressions and insights on the paper from the perspective of a medicinal chemist in a post in his blog In the pipeline, which I recommend reading.