From the BlogHire Me
pdf: Poster Julio E. Peironcely for the ICCS2011
png: Interested in becoming a Scientist 2.0? Then visit my blog
nice poster and hopefully a successful attack on the overall problem!
Some general comments, the known public and commercial implementations
were surely successful for NMR (StrucEluc from ACD and Ensemble from Pretsch)
and of course SMOG and MOLGEN as general purpose structure generators.
I exclude the others because they were not commercially or
publicly available or did not have practical interfaces: formula in – SDF out.
Unfortunately even for the other isomer generators the real issue
when coupled to mass spectrometry or (GC, LC) was always the mass-spec
part and the final filtering part to exclude the high number of false
positives. That means NMR coupled to NMR with a exact number of hybridizations
and exact number of substructures it is not that difficult. But coupled to
mass spectrometry using EI or ESI spectra or MS/MS data in the past the
whole setup was not very successful and the number of false positive results
was usually to high, either because multiple formulae were possible or no
correct substructure could be determined. That can change with data from
the new high-resolution hybrid mass spectrometers.
Using compounds only from the space of natural products would be one
possible way (but there are lots of weird sub-structures anyway) and
of course an orthogonal GC or LC retention filter. But with the recent
interest in the topic and combined efforts that whole problem surely could
The final litmus test for this program will be
a) comparison against MOLGEN in terms of isomer numbers
based on a large set of diverse molecular formulas that
cover all hybridization states of CHNSOP.
b) validation that the same structures are generated
using a hashkey such as INCHIKEY, SIGNATURE or any other canonical
descriptor (compared to MOLGEN or SMOG).
c) speed, I think speed is a major issue, 10-fold slower
than MOLGEN (gold standard) is probably OK, but 100-1000 fold
slower will be a major issue.
For example how long does it take to count/generate all
988,838,502 C7H8O9 isomers? How long does it take to
generate all C20 isomers or other small molecules
such as 113,511,827 C5H6O14?
If its less than 5 minutes, perfect; if its an hour, not acceptable,
because these are the very small molecules. See failed examples
from the old deterministic CDK generator (deprecated) via GOOGLE search:
GENMDeterministicGenerator fails on 30% of mol formulas – ID: 1743861
But there is a way out, parallelization, I am not sure what the current
bottleneck is, I guess the nauty isomorphism check, or the
DFS or backtracking, but there will be a level where the problem
can be split and distributed to n-CPUS. That will be very(!)
interesting because potentially the code can be partially exported
to an AMD or NVIDIA streaming processor via CUDA or OpenCL.
d) memory and I/O issues, if the algorithm is inefficient it will
eat alot of memory and fail if no mem is available.
I consider 32 Gbyte ($320 DDR3 RAM) as a current borderline and
I/O can be tackled with SSDs.
Some errors on your poster:
C7H8O3S1-2 has 10203389 isomers (plus isomorphs, plus aromatic doublettes)
C7H8O3S1-6 has 110449674 isomers (plus isomorphs, plus aromatic doublettes)
The sulfate structure is not included in the S1-2 (sulfur valence=2)
but only in the C7H8O3S1-6 case. The problem is that there are multiple
molecules that can have mixed valencies in one molecule. For sulfur
(thiols, sulfoxides and sulfones ) and for phosphorous
(phosphines and phophonates) (see Seven Golden Rules).
In order to tackle that problem all possible combinations have to
be considered, unless a hybridization state or substructure is given.
You can use MOLGEN demo for C7H8O3S1-6 and with the good substructure
of benzene and one -OH it will only generate 343 isomers with the correct
solution among the first six visible. That is even less than the 948 solutions
from your poster. The MOLGEN editor crashes under XP, but with benzene and
sulfate the solution will be obtained in seconds I guess and only a handful of
structures survive after isomorphism and aromatism check.
Anyway I like the approach and once it is correctly validated,
all the fine tuning can start and all the other modules can be “patched”
together. Very exciting.
Awesome feedback Tobias!!
Concerning how to test the program:a) I obtain the same number of results as MOLGEN when aromaticity is disabled. My code does not include aromaticity detection, MOLGEN does.
b) I do not produce duplicates, although I did not perform thorough tests on this. Need to check with Inchis and with the canonical code I generate with Nauty.
c) I am definitely slower than MOLGEN, maybe for using java and CDK or for having to call Nauty to canonize. The way the algorithm is designed allows parallelization, although its implementation is a task for later on.
d) The algorithm does not use much RAM, I haven’t seen it going above 300Mb.
About the errors:
1) Hybridizations of the different atom types are defined by the atom dictionary of CDK, which is not perfect. Not all configurations for N, S, and P are included and therefore I can only generate what CDK allows me. Obviously I can fix these issues in CDK, but it is not a trivial task. Besides, I am wondering if this will be a major issue for having the proof of concept of this tool published in a journal, what do you think?
2) I will have a look.
Cheers and if you have any more feedback let me know.
“Hybridizations of the different atom types are defined by the atom dictionary of CDK, which is not perfect. Not all configurations for N, S, and P are included and therefore I can only generate what CDK allows me.”
That’s something we can work on. I’ll probably be in NL for a week or two this July, and maybe we should meet up and discuss those issues.
Otherwise, Gillieain has a student working on adding 90 atom types, so there is light at the end of the tunnel.
Where can people download your software? Tobias’ comment nicely shows the advantages of ‘release soon, release often’. I’m personally not afraid that someone would run off with your tool, particularly not know it’s on the record via your ICCS poster and talk.
Like Tobias, I think you should report a performance comparison.
Fantastic news! I’ve been reading your poster (which is, btw, quite difficult on a computer you have to zoom in and out to get the right bits).
I read about CPA, and sortof kindof understood it. I think. However, I never implemented it. When you say “an adaptation of the nauty canonizer” what do you mean from a technical perspective?
Good work, in any case.
Nauty is written in C, I call it from Java using JNI. Maybe there is a more elegant way of doing it or even better, a good canonizer implemented in Java or CDK.
I wish! I have a canonizer in java, except that to call it ‘good’ would be generous. Some recent improvements mean that it now generates better automorphism graphs for vertex/edge colored graphs – as in, element symbols and bond orders.
A few weekends ago I was testing it against nauty, but calling that with System.exec, which doesn’t work so well.
My problem has always been the lack of a good canonical checking method in java. I can canonically label molecules just fine (with signatures), but checking a molecule for canonicity I cannot do.
how about compiling nauty it into JAVA byte code via NestedVM?
Similar to that:
Also the old ESESOC implementation GENMDeterministicGenerator in the CDK worked, just not for all cases And it was only 2000 lines of JAVA code…
See the related publication “Principles for structure generation of organic isomers from molecular formula”
Markus Meringer wrote a nice chapter in Handbook of Chemoinformatics Algorithms, see Structure Enumeration and Sampling
But maybe abstraction is not what is needed for the CDK, but rather a real implementation, just how, all the experts combined together, it must be doable….
Pingback: Julio's United Kingdom Tour 2011 | juliopeironcely.com()
Return to top of page
Copyright © 2017 Minimum Theme on Genesis Framework · WordPress · Log in
The motivation during your PhD is not constant, and it resembles the phases that entrepreneurs experience and that Tim Ferriss...