From the BlogHire Me

Archives for March 2012

Metabolite Retention Time Prediction, Help Needed

Metabolite Retention Time Prediction This is a call for all the experts in HPLC metabolite retention time prediction, or QSRR. Would you build a model using the following data or just give up? Please share your two cents.

Would you build a metabolite retention time prediction model with these data?

I got the a dataset from some HPLC experiments with the mission: can you build a Retention Time (RT) prediction model with these data?

I shared a Google Spreadsheet with the Retention Time of the metabolites.

We are doing metabolite identification and the plan would be to use such a model to reject candidate structures for unknown metabolites. When we would like to identify a metabolite, we will have LC-MSn data, this is retention time, mass and maybe known substructures for an unknown metabolite.

I would propose candidate structures either by mining databases like HMDB or PubChem, or via computer assisted structure generation.  Next I would use my model to predict the RT and reject those structures whose predicted RT is way off from the experimental.

My concerns about the data

  1. We have 161 metabolites with an HMDB_Id and RT (which was measured twice). Notice that 118 of these have RT between 1.1 and 10 minutes (most of them between 1.1 and 3), and only 43 metabolites have RT between 10 and 40 min. This doesn’t look well distributed. That’s the way it is.
  2. 7 Standards were added ( Tyrosine,  Adenosine,  Tryptophan,  Phenylalanine,  Biotin,  LPC-17,  LPC-19) which I could use to correct the experimental RT, like they do with Kovats Indices in gas chromatography (GC). But these standards only show RT for 77 of the 161 metabolites. What to do with this? Building a model with only 77 RTs sounds like to few data points, which could lead to over fitting the model.
  3.  How to use the standards to generate indices?

What kind of data is this? 

This is what I know so far about the experimental setup.
All reagents used were of HPLC grade purity or higher purchased at Sigma-Aldrigh (Gillingam UK).

Preparation of urine samples
Urine samples collected from healthy volunteers in the morning, 3 males and 2 females in total. The samples were diluted with water in a ratio of 1:1 (v/v). 2 ml in total when diluted This is centrifuged at 16.1 krpm at 10 oC. The supernatant is collected afterwards. 375 µl of the supernatant was transferred to a tube to add 75 µl of the academic mixture from 2.1. One urine sample consisted out of all volunteers by adding 75 µl of each volunteer to a volume of 375 µl.

Reproducibility study
Two different reproducibility were checked in positive ion settings, that of the chromatography and the fragmentation repeatability. The internal standards were used to test the reproducibility of the LC by checking the internal standards of each of the volunteers and the pooled sample. The total length of the study was 54 runs (9 runs for each sample) For the fragmentation reproducibility tyrosine (0.01mg/ml) was injected 40 times from 40 different wells and the differences in the mass to charge ratio was studied.

 HPLC/LTQ Orbitrap XL operation
Samples were analyzed in positive ion mode. Samples were analyzed in a randomized order using the Agilent 1200 with a flow of 250 µl coupled to a reversed phase Atlantis C18 T3 column (ID 2.1×100, particle size 3µm,) linked  to the nano ESI (Triversa nanomate, Advion,) and LTQ-Orbitrap XL (thermo Finnegan). The column was eluted with 2 solvents to create a gradient. Solvent A consisted of: 98% H20 + 2 % Acetonitrile + 0.1 % Formic Acid (v/v), Solvent B consisted of 98% Acetonitrile+ 2% H20 + 0.1 % Formic Acid (v/v).  To provide better reproducibility a thermostat was placed over the column in order to minimize the temperature effects in the room during the day. 5 µl sample was injected each run. The injection loop of the LC was 40cm. Centroided mass spectra were acquired between the range of 60-1000 m/z using the LTQ-Orbitrap at a resolution of 60,000.All samples not in use were stored at -20 oC

If your answer is doing the HPLC experiment again

My first goal would be to use the current data to build a model and test it in metabolite identification, keeping in mind the lemma of statisticians concerning data quality “crap in, crap out”. If the data doesn’t allow me to build a model, so be it.
In any case, I might have some student doing similar experiments again, so I could redo the experiments.
  1. How would you collect enough data to make a reasonable model?
  2. Add known compounds to the urine to have more data points?
  3. Use the same standards we have used and make sure they are measured for every data point?
  4. What deviation of predicted RT from experimental is acceptable to reject candidate structures?


If you have any useful tip, please leave a comment or send me an email at julio{at}

and I will be forever grateful.