Building a molecular charge classifier

The intersection of Chemistry and A.I

Flawnson Tong
Towards Data Science

--

A.I has seen unprecedented growth in the past couple years. Although machine learning architectures like Neural Networks (NN) have been known for a long time thanks to breakthroughs from top researchers like Geoffrey Hinton, only recently have NNs become powerful tools in an A.I specialist’s toolbox. This is credited mainly to 3 changes enabled by the coming of the 21st century:

  1. Increased computing power
  2. Increased available data
  3. Increased interest in automation

Most data currently floating on the internet, in government systems, and on personal computers are unlabeled. Labeled data, (i.e data that corresponds to another existing dataset) is much more difficult to procure. Data types range from numeric, images, audio, and text.

Code, Math, and Molecules; the trinity of difficult things to use in A.I

The monumental challenge to overcome was how fragile discrete molecular data is.

Consider this:

  • In a math equation, if one number or operation is changed, the equation is invalid.
  • In a block of code, if any variable, method, function, etc. is changed, the code will execute differently.
  • In a molecule, if a bond, atom, or electron is changed, the molecule is entirely different.

Neural networks have a very unique process that make them different from other algorithms. Data that goes in is bound to be manipulated within the NN, that is to say that it will be influenced by weights, biases, activation functions, and a whole host of other adjustable, learnable, and specified parameters;

Data that goes into a NN will be changed.

Manipulating fragile discrete data is walking on thin ice. Currently, there are 4 main methods to represent a molecule, listed in order of lossyness:

  1. Molecular Fingerprints
  2. String Representations
  3. Molecular Graphs
  4. Simulation

Each has it’s own unique process, but the ones I use to build the molecular charge classifier are string representations and molecular fingerprints. Most cheminformatic data and molecular data is available in the form of InChi strings, or my preferred format, SMILES strings. Most of the time, SMILE strings look like a form of chemical gibberish:

Molecules with their respective SMILE strings

These strings are generated with the help of a grammar. Like English, a single molecule can have different SMILE strings; similar to how one word can have many definition, or how one idea can be communicated with different sentences.

The many faces of Toluene; all the strings represent the same molecule

Diagrams are provided by Dr. Esben Bjerrum, a leading expert and researcher in the journey to conquer chemistry with A.I. To learn more about canonical SMILES enumeration, take a look at his blog post.

To prevent duplicate SMILE strings, only one SMILE string variation was used per molecule. It is worth noting that every molecule has a primary string representation, also known as the canonical SMILE. Armed with a clean labeled dataset of molecules in SMILES format, I set out to build a molecular charge classifier.

Project Charged

I used SMILE strings and a neural network to build a molecular charge classifier in python with the Keras library. Every molecule has a certain charge of 0, 1, -1, and -2 (there are also more, but all the molecules in my dataset where one of the 4 possible charges).

The project was broken down into 4 steps:

  1. Importing SMILES data (with labels i.e the charges)
  2. Converting data to a 256 bit binary code
  3. Building the deep NN
  4. Compiling and fitting the model

For this project, the model is trained with 1000 SMILE strings of a dataset of 250,000. I turned each SMILE string into 256 bit binary code, so that I could build the NN with the same number of input neurons; 256. The model halved the number of neurons at each progressing layer, from 256 → 128 → 64 → 34 → 16, and finally to 4 output neurons, one for each possible charge in the dataset (0, 1, -1, -2).

10% of the data (100 SMILE strings) where taken out of the training set to be used as validation data. The model uses relu as its input layer activation function, and sigmoid for the hidden layers, before finally outputting results in the form of probabilities between 0 and 1 with the softmax function. Classification is therefore presented as a probabilistic decimal that the NN calculates based on its training.

The model is tested with 10 new molecules and successfully predicted the charges of all ten, with some yielding higher likelihood percentages (80%^) and others with less (40%^)

All the code and datasets for Project Charged can be found on my Github, in this repository.

The challenge with chemical data is finding the method to represent the molecules you are working with. The ideal representation accounts for the fragility of molecular data, while still keeping inherent features and patterns intact. The representation should strike the right balance between accuracy and computational efficiency.

Key Takeaways

  1. Molecules, like code and math, are prone to fragility when being manipulated
  2. There are 4 methods of representing molecules in computational terms; Fingerprints, String representations, Molecular graphs, or Simulation
  3. SMILES is a popular string representation; there can be many strings representing one molecule, but there is only one canonical SMILES
  4. Converting a molecule into binary fingerprints and inputting them into a neural network is one method to leverage A.I in chemistry
  5. Molecular representation and cheminformatics as a whole is a balance of accuracy and computational power

Still reading this? Want more? Not sure what to do next?

  • Share it on LinkedIn, Facebook, or Twitter! (And add me while you’re at it)
  • Check out my portfolio for more cool projects, content, and updates!
  • Reach out on any platform for questions, collaborations, and ideas!

Onwards!

--

--