From the BlogHire Me

Molecular Weight Gap in ZINC Database

ZINC database clustering 70% Tanimoto Cutoff

MW distribution ZINC database 70% Tanimoto Cutoff

ZINC is a popular database of purchasable compounds, widely used for screening for new compounds or as a representation of chemical space. I was recently using it for building some models, and since it contains quite some molecules(~20M) I opted for using the clustered datasets ZINC provides. Unfortunately, this clustering only selects molecules below 370Da. This can be caused by the way the clustering is perform, this is, by sorting the molecules by increasing molecular weight and selecting as representatives  those that differ from the previously accepted by the Tanimoto cutoff(60% to 90%). As the images show, this clustering will prioritize small molecules, but maybe a bit too much??

If we have a look at the molecular weight of molecules from DrugBank, we can see that approved drugs populate the molecular weight range from 100 to 700 Da. It seems that if you are trying to use the clustered sets from ZINC you might be missing an important part of the chemical space.

Molecular Weight Distribution in DrugBank database

MW distribution in DrugBank

What I ended up doing was to cluster myself the whole ZINC database, but clustering 20M molecules was a bit to demanding for any clustering. I used Pipeline Pilot for sampling randomly ZINC to generate 10 smaller datasets. Next, I clustered using ECFP_4 fingerprints and 60% Maximum Dissimilarity (this is, clusters contain molecules that are 40% similar or more). Last step is to build the final dataset using cluster centers as representatives. By doing so, I managed to produce a reduced dataset, much easier to handle, that still keeps the chemical diversity of the large one.

MW Distribution of ZINC database after Clustering with Pipeline Pilot

MW Distribution after Clustering 60% dissimilarity

Remember to always check the datasets you download from public databases, which despite the great work done, sometimes include some feature that you would not like to have and that could spoil your experiments and make you waste some time.Interested in becoming a Scientist 2.0? Then visit my blog

  • Pingback: Tweets that mention Molecular Weight Gap in ZINC Database | --

  • J Overington

    Interesting post and analysis, but I think there are pretty good reasons to focus on smaller molecules. 1) Ligand efficiency, 2) Oral drugs tend to be around 320 MWt, and there are some outliers at higher MWt, but they often are idiosyncratic wrt their ADMET properties. 3) Frequent hitters/Promiscuous/Non-specific compounds have higher MWt on average. 4) During optimisation, invariably MWt/logP is increased, tending to make things worse. 5) My feeling is that docking (as one method of Virtual Screening should work better for small molecules (due to sampling/conformational space issues). 

    Also, why do you think there is the spike in your sampling at around 150 MWt? (of course it may not be statistically significant, but you have processed a lot of compounds).

    Finally, I think that the GDB databases are really cool for potentially analysing some of these chemical space issues.

Read previous post:
Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!