From the BlogHire Me

Poster for the 9th International Conference on Chemical Structures

pdf: Poster Julio E. Peironcely for the ICCS2011

png: Interested in becoming a Scientist 2.0? Then visit my blog

  • Tobias Kind

    Hello Julio,
    nice poster and hopefully a successful attack on the overall problem!

    Some general comments, the known public and commercial implementations
    were surely successful for NMR (StrucEluc from ACD and Ensemble from Pretsch)
    and of course SMOG and MOLGEN as general purpose structure generators.
    I exclude the others because they were not commercially or
    publicly available or did not have practical interfaces: formula in – SDF out.

    Unfortunately even for the other isomer generators the real issue
    when coupled to mass spectrometry or (GC, LC) was always the mass-spec
    part and the final filtering part to exclude the high number of false
    positives. That means NMR coupled to NMR with a exact number of hybridizations
    and exact number of substructures it is not that difficult. But coupled to
    mass spectrometry using EI or ESI spectra or MS/MS data in the past the
    whole setup was not very successful and the number of false positive results
    was usually to high, either because multiple formulae were possible or no
    correct substructure could be determined. That can change with data from
    the new high-resolution hybrid mass spectrometers.

    Using compounds only from the space of natural products would be one
    possible way (but there are lots of weird sub-structures anyway) and
    of course an orthogonal GC or LC retention filter. But with the recent
    interest in the topic and combined efforts that whole problem surely could
    be tackled.

    The final litmus test for this program will be
    a) comparison against MOLGEN in terms of isomer numbers
    based on a large set of diverse molecular formulas that
    cover all hybridization states of CHNSOP.

    b) validation that the same structures are generated
    using a hashkey such as INCHIKEY, SIGNATURE or any other canonical
    descriptor (compared to MOLGEN or SMOG).

    c) speed, I think speed is a major issue, 10-fold slower
    than MOLGEN (gold standard) is probably OK, but 100-1000 fold
    slower will be a major issue.

    For example how long does it take to count/generate all
    988,838,502 C7H8O9 isomers? How long does it take to
    generate all C20 isomers or other small molecules
    such as 113,511,827 C5H6O14?

    If its less than 5 minutes, perfect; if its an hour, not acceptable,
    because these are the very small molecules. See failed examples
    from the old deterministic CDK generator (deprecated) via GOOGLE search:
    GENMDeterministicGenerator fails on 30% of mol formulas – ID: 1743861

    But there is a way out, parallelization, I am not sure what the current
    bottleneck is, I guess the nauty isomorphism check, or the
    DFS or backtracking, but there will be a level where the problem
    can be split and distributed to n-CPUS. That will be very(!)
    interesting because potentially the code can be partially exported
    to an AMD or NVIDIA streaming processor via CUDA or OpenCL.

    d) memory and I/O issues, if the algorithm is inefficient it will
    eat alot of memory and fail if no mem is available.
    I consider 32 Gbyte ($320 DDR3 RAM) as a current borderline and
    I/O can be tackled with SSDs.

    Some errors on your poster:
    C7H8O3S1-2 has 10203389 isomers (plus isomorphs, plus aromatic doublettes)
    C7H8O3S1-6 has 110449674 isomers  (plus isomorphs, plus aromatic doublettes)

    The sulfate structure is not included in the S1-2 (sulfur valence=2)
    but only in the C7H8O3S1-6 case. The problem is that there are multiple
    molecules that can have mixed valencies in one molecule. For sulfur
    (thiols, sulfoxides and sulfones ) and for phosphorous
    (phosphines and phophonates) (see Seven Golden Rules).
    In order to tackle that problem all possible  combinations have to
    be considered, unless a hybridization state or substructure is given.

    You can use MOLGEN demo for C7H8O3S1-6 and with the good substructure
    of benzene and one -OH it will only generate 343 isomers with the correct
    solution among the first six visible. That is even less than the 948 solutions
    from your poster. The MOLGEN editor crashes under XP, but with benzene and
    sulfate the solution will be obtained in seconds I guess and only a handful of
    structures survive after isomorphism and aromatism check.

    Anyway I like the approach and once it is correctly  validated,
    all the fine tuning can start and all the other modules can be “patched”
    together. Very exciting.

    Tobias Kind

  • Julio E. Peironcely

    Awesome feedback Tobias!!

    Concerning how to test the program:a)  I obtain the same number of results as MOLGEN when aromaticity is disabled. My code does not include aromaticity detection, MOLGEN does. 
    b)  I do not produce duplicates, although I did not perform thorough tests on this. Need to check with Inchis and with the canonical code I generate with Nauty.
    c) I am definitely slower than MOLGEN, maybe for using java and CDK or for having to call Nauty to canonize. The way the algorithm is designed allows parallelization, although its implementation is a task for later on. 
    d) The algorithm does not use much RAM, I haven’t seen it going above 300Mb.

    About the errors:
    1) Hybridizations of the different atom types are defined by the atom dictionary of CDK, which is not perfect. Not all configurations for N, S, and P are included and therefore I can only generate what CDK allows me. Obviously I can fix these issues in CDK, but it is not a trivial task. Besides, I am wondering if this will be a major issue for having the proof of concept of this tool published in a journal, what do you think?
    2) I will have a look.

    Cheers and if you have any more feedback let me know.

  • Egon Willighagen

    Hi Julio,

    “Hybridizations of the different atom types are defined by the atom dictionary of CDK, which is not perfect. Not all configurations for N, S, and P are included and therefore I can only generate what CDK allows me.”

    That’s something we can work on. I’ll probably be in NL for a week or two this July, and maybe we should meet up and discuss those issues.

    Otherwise, Gillieain has a student working on adding 90 atom types, so there is light at the end of the tunnel.

    Where can people download your software? Tobias’ comment nicely shows the advantages of ‘release soon, release often’. I’m personally not afraid that someone would run off with your tool, particularly not know it’s on the record via your ICCS poster and talk.

    Like Tobias, I think you should report a performance comparison.


  • Gilleain Torrance

    Hi Julio,

    Fantastic news! I’ve been reading your poster (which is, btw, quite difficult on a computer you have to zoom in and out to get the right bits).

    I read about CPA, and sortof kindof understood it. I think. However, I never implemented it. When you say “an adaptation of the nauty canonizer” what do you mean from a technical perspective?

    Good work, in any case.


  • Julio E. Peironcely

    Hi Gilleain,

    Nauty is written in C, I call it from Java using JNI. Maybe there is a more elegant way of doing it or even better, a good canonizer implemented in Java or CDK.


  • Gilleain Torrance

    I wish! I have a canonizer in java, except that to call it ‘good’ would be generous. Some recent improvements mean that it now generates better automorphism graphs for vertex/edge colored graphs – as in, element symbols and bond orders.

    A few weekends ago I was testing it against nauty, but calling that with System.exec, which doesn’t work so well.

    My problem has always been the lack of a good canonical checking method  in java. I can canonically label molecules just fine (with signatures), but checking a molecule for canonicity I cannot do.


  • Tobias Kind

    how about compiling nauty it into JAVA byte code via NestedVM?

    Similar to that:

    Also the old ESESOC implementation GENMDeterministicGenerator in the CDK worked, just not for all cases :-) And it was only 2000 lines of JAVA code…

    See the related publication “Principles for structure generation of organic isomers from molecular formula”

    Markus Meringer wrote a nice chapter in Handbook of Chemoinformatics Algorithms, see Structure Enumeration and Sampling“Nodes+of+molecular+graphs+are+colored”

    But maybe abstraction is not what is needed for the CDK, but rather a real implementation, just how, all the experts combined together, it must be doable….


  • Pingback: Julio's United Kingdom Tour 2011 |

Read previous post:
5 Phases of PhD Motivation Explained: The Roller Coaster Curve

The motivation during your PhD is not constant, and it resembles the phases that entrepreneurs experience and that Tim Ferriss...