J. Peironcely (1,2,3) M. Rojas-Cherto (2,3), P. Kasper (2,3), L. Coulier (1,3), R. Vreeken(2,3), T. Reijmers (2,3), A. Bender (4), JL. Faulon (5), T. Hankemeier (2,3)
1 TNO Quality of Life, Zeist, The Netherlands
2 Leiden University, Leiden, The Netherlands
3 Netherlands Metabolomics Centre, Leiden, The Netherlands
4 University of Cambridge, Cambridge, United Kingdom
5 University of Evry, Evry, France
To provide detailed information of biological phenotypes on a chemical basis metabolomics aims at profiling all sorts of metabolites, which form a chemically diverse group of substrates and products involved in enzymatic pathways. Current analytical platforms used in metabolomics produce a large amount of complex data, which require chemoinformatics tools to process and transform this data into meaningful information. For biological interpretation identification of metabolites (elucidating the chemical structure of the metabolites of interest) is essential. New analytical platforms and better software tools are required to advance in metabolite identification. Here we present a pipeline of software tools developed to facilitate identification of metabolites measured with Liquid Chromatography – Mass Spectrometry (LC-MS).
High-resolution multi stage MS spectra (MSn data) were acquired for metabolite standards listed in the HMDB (Human Metabolome Database). Currently no tool exists that captures all relevant information present in MSn data so a software tool was developed, integrating the Chemistry Development Kit (CDK) and XCMS, for preprocessing the spectral data. The Multi-stage Elemental Formula (MEF) tool automatically resolves the elemental composition of the parent compound, the fragment ions, and the neutral losses. This process of elemental formula assignment and fitting also removes artifacts of the spectra. The resulting enriched MSn data of many metabolite standards are stored in XML format in a MSn database, to allow structural elucidation of unknown metabolites by comparing the MSn data of the unknowns with the MSn data in the database. The database also enables the characterization of substructures from the unknown compound by querying and matching subsets of the MSn data. A fingerprint based similarity search for MSn data was developed to find out which trees in the database are most similar to an experimentally acquired MSn data.
An open source chemical structure generator was implemented to generate candidate structures using the elemental formula and substructure information obtained with the previous tools. This structure generator combines concepts of graph theory and a chemistry library, the CDK, to exhaustively generate all non-isomorphic chemical structures for the input data. This input data is an elemental formula and optionally, one or multiple non-overlapping prescribed substructures. The output of the structure generator is a, usually large, list of structures which need to be further reduced. Therefore, models of Metabolite-Likeness were built to reject structures that do not resemble metabolites. Different molecular descriptors, fingerprints, and classifiers were evaluated, and the best combination employed to build a final model. Only candidate structures with a high Metabolite-likeness are kept in our Metabolite Identification pipeline.
In this work we demonstrate how this workflow of chemoinformatics tools improves the state of the art in metabolite identification using real life samples and how it helps to translate experimental data into chemical data.