April 2010

Getting Real

I have finished reading the book Getting Real by Jason Fried and David Heinemeier, the guys from 37signals. This book is just awesome, really, just get it. It proposes ideas to create web apps and how to make them viable. Ideas such as having less features, less code, less employees, and less documentation, but more satisfaction being an entrepreneur. All these ideas worked nicely for them when they were developing Basecamp and the other apps. Anyhow, these suggestions do not need to be applicable to every endeavor in your business life, but just give them a try and use what suits you better.

Highly recommended.

Metabolomics Conference 2010

Every year there is a Metabolomics Conference organized by the Metabolomics Society, and this year it will be held in Amsterdam. Lots of cool people in the field will be attending and so will I. If anyone is planing to come to the conference and would like to invite me to a beer, please leave a comment. Inquires about work, possible collaborations, business and so on are also welcome, but do not forget about the beer and that the real conference begins after the presentations are over.

Here is the event in LinkedIn, in case you want to see if somebody from your network is coming to Amsterdam.

Freelancing and Startups in Chemoinformatics

In the Internet world, it seems that if you are not an entrepreneur or you do not develop cool Web 2.0 apps you are a nobody. There are lots of people that make a living from and inside the Internet, maybe freelance designers, web developers, full time bloggers, or even ebook writers. You have the people from 37signals, or Seth Godin, or Colin Wright (the last addition to my RSS feed).

But what is happening with science, researchers, PhDs, etc? Where are the startups in chemoinformatics and the guru bloggers? You have Metamolecular, from blogger and entrepreneur Rich Apodaca. Freelancer John Van DrieEgon should also be mentioned as an hyperactive blogger and open source developer, as well as Rajarshi. Another example of an alternative career after a PhD in chemoinformatics could be Jeroen Kazius, who extended his substructure mining tools and created a company called Curios-IT in order to keep working in science in the way that fitted his lifestyle.

It might seem that scientist are on good track for the new Internet era. But I do not think so. If you google for “chemoinformatics or bioinformatics freelance” you do not get many results. The same happens for searches related to startups in these fields. There might be many factors influencing this reality:

  • Universities and companies are not used to externalize and outsorce part of their research.
  • There is quite some secrecy about projects, due to IP protection.
  • Prejudice, a freelancer might be seen as somebody who did not manage to get a career in academy.
  • Chemo/bioinformaticians, as it happens as well with biostatisticians, depend(many times) on the data generated by some other people. Well, provided that you want to make research with a certain impact you need real life experimental data, unless you are happy providing yet another mashup of fingerprints for QSAR.
  • Archaic and rigid systems still applied in science, such as the “publish or perish” approach to validate and promote your research, the PhD->postdoc(1,2,3…)-> tenure track-> assistant professor set path and the fact that choosing between industry and academia is a choice for life.

In ideal world, people like me would like to feel the freedom to experiment, not only with natural phenomena or hypothesis, but also with his own career, ideas, and personal challenges.

Molecular Weight Gap in ZINC Database

ZINC database clustering 70% Tanimoto Cutoff

MW distribution ZINC database 70% Tanimoto Cutoff

ZINC is a popular database of purchasable compounds, widely used for screening for new compounds or as a representation of chemical space. I was recently using it for building some models, and since it contains quite some molecules(~20M) I opted for using the clustered datasets ZINC provides. Unfortunately, this clustering only selects molecules below 370Da. This can be caused by the way the clustering is perform, this is, by sorting the molecules by increasing molecular weight and selecting as representatives  those that differ from the previously accepted by the Tanimoto cutoff(60% to 90%). As the images show, this clustering will prioritize small molecules, but maybe a bit too much??

If we have a look at the molecular weight of molecules from DrugBank, we can see that approved drugs populate the molecular weight range from 100 to 700 Da. It seems that if you are trying to use the clustered sets from ZINC you might be missing an important part of the chemical space.

Molecular Weight Distribution in DrugBank database

MW distribution in DrugBank

What I ended up doing was to cluster myself the whole ZINC database, but clustering 20M molecules was a bit to demanding for any clustering. I used Pipeline Pilot for sampling randomly ZINC to generate 10 smaller datasets. Next, I clustered using ECFP_4 fingerprints and 60% Maximum Dissimilarity (this is, clusters contain molecules that are 40% similar or more). Last step is to build the final dataset using cluster centers as representatives. By doing so, I managed to produce a reduced dataset, much easier to handle, that still keeps the chemical diversity of the large one.

MW Distribution of ZINC database after Clustering with Pipeline Pilot

MW Distribution after Clustering 60% dissimilarity

Remember to always check the datasets you download from public databases, which despite the great work done, sometimes include some feature that you would not like to have and that could spoil your experiments and make you waste some time.