Let’s make some molecules with machine learning! ⚛️

M.L in medicine and material science

Published in

Towards Data Science

10 min readFeb 18, 2019

It takes 10 years on average for a new molecule to reach the mass market.

In April 1912, the RMS Titanic collided with an iceberg in the Atlantic while on her maiden voyage, drowning 1,500 lives and making history. Decades later, the wreck was recovered, still rich with 20th century history. A couple months after recovering the Titanic, the Challenger space shuttle was set to launch with the eager eyes of millions watching from screens across the nation. 73 seconds from launch, the shuttle exploded in midair.

The fundamental problem with both disasters lie beyond human error and design. The problem is in the material. The hull of the Titanic failed as much as the right rocket booster of the Challenger. We remember these events as tragedies not because they happened, but rather because they could have been prevented. Better materials where known whilst the Titanic and Challenger where being built. But simply knowing that there exist better materials is not enough; they have to be available and applicable.

The problem, is the immense time it takes for a molecule to get from knowledge to application. Today, the average time is 10 years.

According to Scientifist, the average time for a drug to complete this process is **12 years and 1.5B dollars** (courtesy of Scientifist)

10 years is enough for trillions of dollars to be spent, half a billion people to die, and millions of hours wasted trying to solve problems that could have been made much less severe if only we had the right material or drug.

This isn’t new. Historically, scientific progress has been slow. Tools made of stone remained the standard for centuries, and medical treatments that were usually more harmful than helpful where prescribed regularly. It wasn’t until humanity placed emphasis on the advancement of science did we begin creating better technology, thus beginning the positive feedback loop:

For a while, this feedback loop resulted in stable linear growth. But as our mastery of science grew, so too did our technological advancements. With the advent of technology like A.I, this growth is bound to accelerate exponentially.

A.I is making waves in the domain of science by giving researchers more control over molecular space.

In my previous article, A.I enhanced molecular discovery and optimization, current methods of accelerating science with A.I where discussed at a high level. Here I’ll be walking through a recent project of mine that follows the most common workflow/pipeline that is employed in current research of ML applications in the domain of science.

Project Novel

The goal of this project is to generate novel molecules with a Recurrent Neural Network. These molecules might not be useful, perhaps not even valid, but the idea is to train the model to learn the patterns in SMILES strings such that the output resembles valid molecules. SMILES is a string representation of a molecule, based off the structure and components of a given molecule. For example, Ciprofloxacin looks a bit like this:

**Colors** correspond to the **character** (Courtesy of Wikipedia)

The neural network of choice depends on the type of data we’ll be feeding it. In this case, we’ll be feeding it SMILES strings as data, so a Recurrent Neural Network (RNN) is best suited for the job. It’s custom to enhance the inner architecture of an RNN with it’s more efficient and effective relative; the LSTM cell. We’ll be using a couple layers consisting of LSTMs, trained on a dataset of over 200,000 SMILES strings.

Step 1 — Mapping:

The first step, as is in most language processing RNNs, is to create a mapping of characters to integers (to allow the neural network to process the data) and vice versa (to translate results back to characters).

The easiest way to do this is by creating a set of unique characters and enumerate each item. In a natural language like English, there are 26 letters (double that for capital letters), along with a whole host of grammatical symbols and syntax characters. In SMILES strings, there are two types of characters:

Special characters: “/”, “(“, “)”, “=”, etc.
Elemental symbols: “C”, “O”, “Si”, “Co”, etc.

This list of unique characters is enumerated and conveniently placed into a dictionary.

It’s worth noting that the dictionary does not consider elemental symbols consisting of 2 characters as a single element. For example, the “S” and “i” of the elemental symbol for Silicon counts as 2 independent characters. This means that the model will have to learn the difference between “Si” and “S”, which are both entirely different elements. It’s totally possible to hard code these 2-character symbols into the dictionary, but just for fun, I left them separate to see how well the model would perform.

Step 2 —Data preprocessing:

Once you have your unique character mapping, you can go about translating every character in the SMILES string dataset into integers. A simple call of the dictionary we built in Step 1 should do the trick.

At the same time, you can normalize all the integers by simply dividing each by the total number of unique characters in the dataset. Finally, reshape the resulting integerized and normalized dataset into a format that fits the neural network model.

Step 3 — Model Architecture

Using Keras, building the model is super easy.

Play around with the numbers! Try more layers, more neurons, more dropout, whatever you fancy!

Our model has multiple layers, each with a decreasing number of neurons per hidden layer; feel free to play around with these numbers. It’s a big generalization but the rule of thumb is, the more layers and neurons in the neural network, the more computationally intensive, but also the better its performance.

Step 4— Checkpoints:

We’ve all been there. You’ve made lots of progress on that essay or report you’ve been writing when all of a sudden your computer glitches out — and all your work disappears into the abyss. Incidentally, training neural networks has the same problem. Thankfully, there’s a solution.

Remember in the new Mario games, once you got past the halfway mark, the game would save your progress?

For nostalgia, and to emphasize how important it is to **SAVE!!!** (courtesy of TrustedReviews)

Checkpoints are a built in function in the Keras library that allows us to save our training progress (the weights of the model at any given epoch) in a file, which can then be transferred to another device or saved for later. Checkpointing is probably the least appreciated yet one of the most useful machine learning know-hows to have up your sleeve.

Don’t forget to call the callbacks parameter when fitting your model!

Checkpoints are especially useful for separating the training from the predicting, classifying, or generating steps. By training on a GPU or Cloud service before loading the saved weights on a CPU, you could cut down the time needed to complete a project. Checkpoints are therefore useful for transfer learning, or to simply pause and resume training. They can also be used to sample the output of your model for each improving epoch, giving a little more transparency into the network model.

Step 5— Training:

I used categorical cross entropy as the loss function with an Adam optimizer (A mix of RMS-prop with ADAgrad and a built in momentum). To take advantage of as much of the dataset as possible, the model was given 19 epochs of 512 batch size to learn from. Generally, more epochs paired with smaller batch sizes allow the network to learn from the data much better, but at the cost of a much longer training time.

Step 6— Generating:

Generating is relatively straight forward. First we import the checkpoints that we saved from training (so that we don’t have to retrain the model every time we want to generate a new molecule). The next step is to select a random SMILES string from the dataset as reference, and finally generate a specified number of characters.

The exact code is used for generating any kind of text - only in this case, the text is a molecule.

Results will vary according to the choice of dataset size, learning rate, and other custom adjustments that can be made to the model, but ideally, you should get an output that resembles something like a valid molecule. If you plug in a checkpoint file with a lower accuracy, you could actually see how the neural network learns and gets better over time. Usually, it starts with a sequence of just once character:

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

But it gets better over time to include alternating characters

C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1C1

Eventually the output will be even more diverse as it learns substructures

C1C1C1C1C1C1((((((((((/////////cccccccccHcHcHcHcHcHcHcHc)))))))

The goal is to get is as close to

O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5

as possible.

Possible Improvements

This project could be enhanced in a couple different ways, some centered around the data and others around the architecture of the model itself.

The Dataset

The dataset of 200k molecules, while impressive, wouldn’t hurt if it where bigger. Apart from scouring the internet for more SMILES strings, a possible technique that could be implemented is data augmentation.

**Data augmentation** as explained using **Andy Warhol paintings**

Data augmentation is actually similar to an Andy Warhol painting; take one image, change it a bit, and add it to your dataset. That, in essence, is data augmentation, albeit oversimplified. The same can be done to SMILES strings; take the string, find it’s permutations, and add it to your dataset.

In this case, the goal of the project is to generate molecules that are as close to valid as possible, not molecules that have a specified or desired property. This eliminates the risk of biasing results towards one specific type of molecule, hence it is safe to augment our existing dataset of 200k molecules. We can augment data by enumerating the SMILES strings beyond it’s canonical form, which basically means write the same SMILES string in another way, all of which ultimately represent the same molecule. This gives the model significantly more strings to learn from, as each molecule has many different SMILES strings, the number of which increases as the size and complexity of the molecule increases.

The Architecture

In this project, and LSTM was used as the main node in each layer. There exists however, plenty of other powerful options. Neural Turing Machines (NTM) and it’s bigger brother, the Differentiable Neural Computer (DNCs) for example, are both state of the art architectures created by researchers at Google. Both architectures are powerful. NTMs and DNCs are impressive due to an external component that allows them to “remember” things better than LSTMs. This external component acts as a sort of memory bank.

The DNC architecture (courtesy of The Nature magazine)

You can read all about it in these blog posts (NTMs, DNCs) and research papers (NTMs, DNCs) about them, or get a high level overview in my article about all the members of the RNN architecture family. Essentially, these two powerful alternatives both have memory banks with built in attention mechanisms, which allow them to selectively memorize portions of data (in this case text, but can also be for images), that are learned to be very important. This could be really useful in the generation of molecules with desirable properties, since some properties like toxicity are determined by a substructure within the molecule, which, when translated in terms of SMILES, is a specific portion of the string.

There are other factors to consider, such as the abundance of other possible datasets and formats (Ex. different type of molecular representation instead of SMILES, such as SMARTS) and architecture improvements (Ex. the use of adversarial training or reinforcement learning). There are already some noteworthy implementations of both improvements available.

Specialized tools

There are some pretty awesome tools out there for large scale and high quality productions that not many are aware of. Computational material science, biology, chemistry, and physics projects will use packages like:

Matminer — A Python library for data mining in material science
Magpie — A Java based library for ML in material property prediction
PyMKS — A Python library that specializes in structure-property relations
Deepchem — A Python library that democratizes ML in science
Openbabel — A Python and C++ library for bio- and chem- informatics

These specialized packages are congruent with Python’s built in and extended libraries, meaning they can be used together with numpy, scipy, matplotlib, etc. for projects. Future work will explore what these toolkits can do!

What’s next + Key Takeaways

Project Novel is an example of the most rudimentary way molecules can be generated with basic M.L tools and a solid dataset. Most research will combine aspects from other architectures like adversarial training, or reinforcement learning to boost the validity of generated molecules or bias the results of the model towards molecules with desired properties.

Even at it’s best, LSTMs have a limit to how convincing it can make a sample of generated text look. The reason lies in SMILES strings themselves, which are not the optimal way to represent molecules computationally. In the near future there will likely be better frameworks available specifically for cheminformatics and related scientific fields. For now however, the standard remains set at the dynamic duo of recurrent neural nets and molecular string representations.

In my next post, we’ll take a dive into where the intersection of science and A.I is heading towards and what will change as M.L accelerates science.

Key Takeaways

Material and medicinal failures are the result of slow pace of research and development
The slow pace of research and development can be improved with exponential technology like M.L
The most popular way that A.I is being applied to science is with the use of recurrent neural networks and molecular string representations
Recurrent neural networks and molecular string representations are not the optimal way to apply M.L to science
The intersection of A.I and science has only scratched the surface; projects are still rudimentary, and there is lots to look forward to!

Need to see more content like this?

Follow me on LinkedIn, Facebook, Instagram, and of course, Medium for more content.

All my content is on my website and all my projects are on GitHub

I’m always looking to meet new people, collaborate, or learn something new so feel free to reach out to flawnsontong1@gmail.com

Upwards and onwards, always and only 🚀

📝 Read this story later in Journal.

🗞 Wake up every Sunday morning to the week’s most noteworthy Tech stories, opinions, and news waiting in your inbox: Get the noteworthy newsletter >