ZINC is a popular database of purchasable compounds, widely used for screening for new compounds or as a representation of chemical space. I was recently using it for building some models, and since it contains quite some molecules(~20M) I opted for using the clustered datasets ZINC provides. Unfortunately, this clustering only selects molecules below 370Da. This can be caused by the way the clustering is perform, this is, by sorting the molecules by increasing molecular weight and selecting as representatives those that differ from the previously accepted by the Tanimoto cutoff(60% to 90%). As the images show, this clustering will prioritize small molecules, but maybe a bit too much??
If we have a look at the molecular weight of molecules from DrugBank, we can see that approved drugs populate the molecular weight range from 100 to 700 Da. It seems that if you are trying to use the clustered sets from ZINC you might be missing an important part of the chemical space.
What I ended up doing was to cluster myself the whole ZINC database, but clustering 20M molecules was a bit to demanding for any clustering. I used Pipeline Pilot for sampling randomly ZINC to generate 10 smaller datasets. Next, I clustered using ECFP_4 fingerprints and 60% Maximum Dissimilarity (this is, clusters contain molecules that are 40% similar or more). Last step is to build the final dataset using cluster centers as representatives. By doing so, I managed to produce a reduced dataset, much easier to handle, that still keeps the chemical diversity of the large one.
Remember to always check the datasets you download from public databases, which despite the great work done, sometimes include some feature that you would not like to have and that could spoil your experiments and make you waste some time.Interested in becoming a Scientist 2.0? Then visit my blog