From the BlogHire Me

Metabolite Retention Time Prediction, Help Needed

Metabolite Retention Time Prediction This is a call for all the experts in HPLC metabolite retention time prediction, or QSRR. Would you build a model using the following data or just give up? Please share your two cents.

Would you build a metabolite retention time prediction model with these data?

I got the a dataset from some HPLC experiments with the mission: can you build a Retention Time (RT) prediction model with these data?

I shared a Google Spreadsheet with the Retention Time of the metabolites.

We are doing metabolite identification and the plan would be to use such a model to reject candidate structures for unknown metabolites. When we would like to identify a metabolite, we will have LC-MSn data, this is retention time, mass and maybe known substructures for an unknown metabolite.

I would propose candidate structures either by mining databases like HMDB or PubChem, or via computer assisted structure generation.  Next I would use my model to predict the RT and reject those structures whose predicted RT is way off from the experimental.

My concerns about the data

  1. We have 161 metabolites with an HMDB_Id and RT (which was measured twice). Notice that 118 of these have RT between 1.1 and 10 minutes (most of them between 1.1 and 3), and only 43 metabolites have RT between 10 and 40 min. This doesn’t look well distributed. That’s the way it is.
  2. 7 Standards were added ( Tyrosine,  Adenosine,  Tryptophan,  Phenylalanine,  Biotin,  LPC-17,  LPC-19) which I could use to correct the experimental RT, like they do with Kovats Indices in gas chromatography (GC). But these standards only show RT for 77 of the 161 metabolites. What to do with this? Building a model with only 77 RTs sounds like to few data points, which could lead to over fitting the model.
  3.  How to use the standards to generate indices?

What kind of data is this? 

This is what I know so far about the experimental setup.
All reagents used were of HPLC grade purity or higher purchased at Sigma-Aldrigh (Gillingam UK).

Preparation of urine samples
Urine samples collected from healthy volunteers in the morning, 3 males and 2 females in total. The samples were diluted with water in a ratio of 1:1 (v/v). 2 ml in total when diluted This is centrifuged at 16.1 krpm at 10 oC. The supernatant is collected afterwards. 375 µl of the supernatant was transferred to a tube to add 75 µl of the academic mixture from 2.1. One urine sample consisted out of all volunteers by adding 75 µl of each volunteer to a volume of 375 µl.

Reproducibility study
Two different reproducibility were checked in positive ion settings, that of the chromatography and the fragmentation repeatability. The internal standards were used to test the reproducibility of the LC by checking the internal standards of each of the volunteers and the pooled sample. The total length of the study was 54 runs (9 runs for each sample) For the fragmentation reproducibility tyrosine (0.01mg/ml) was injected 40 times from 40 different wells and the differences in the mass to charge ratio was studied.

 HPLC/LTQ Orbitrap XL operation
Samples were analyzed in positive ion mode. Samples were analyzed in a randomized order using the Agilent 1200 with a flow of 250 µl coupled to a reversed phase Atlantis C18 T3 column (ID 2.1×100, particle size 3µm,) linked  to the nano ESI (Triversa nanomate, Advion,) and LTQ-Orbitrap XL (thermo Finnegan). The column was eluted with 2 solvents to create a gradient. Solvent A consisted of: 98% H20 + 2 % Acetonitrile + 0.1 % Formic Acid (v/v), Solvent B consisted of 98% Acetonitrile+ 2% H20 + 0.1 % Formic Acid (v/v).  To provide better reproducibility a thermostat was placed over the column in order to minimize the temperature effects in the room during the day. 5 µl sample was injected each run. The injection loop of the LC was 40cm. Centroided mass spectra were acquired between the range of 60-1000 m/z using the LTQ-Orbitrap at a resolution of 60,000.All samples not in use were stored at -20 oC

If your answer is doing the HPLC experiment again

My first goal would be to use the current data to build a model and test it in metabolite identification, keeping in mind the lemma of statisticians concerning data quality “crap in, crap out”. If the data doesn’t allow me to build a model, so be it.
In any case, I might have some student doing similar experiments again, so I could redo the experiments.
  1. How would you collect enough data to make a reasonable model?
  2. Add known compounds to the urine to have more data points?
  3. Use the same standards we have used and make sure they are measured for every data point?
  4. What deviation of predicted RT from experimental is acceptable to reject candidate structures?


If you have any useful tip, please leave a comment or send me an email at julio{at}

and I will be forever grateful.

Interested in becoming a Scientist 2.0? Then visit my blog

  • Tobias Kind

    I did a quick check on your RT marker stability, some of them differ up to 2.5 minutes.
    That is quite a lot of deviation. 20 seconds, ok 40 seconds mhhh. 

    Name Tyr Ade Tryp Phen         Biot         LPC-17 LPC-19min         2.36 2.72 4.34 8.800 12.85 33.48 33.49max         3.20 5.26 6.60 10.32 13.59 33.82 37.52max-min 0.84 2.54 2.26 1.520 0.740 0.340 4.030Average 2.74 3.56 5.41 9.650 13.19 33.60 37.19STEDV 0.24 0.52 0.58 0.370 0.220 0.100 0.460

    Some comments that might be of general interest, the model I sent to you with CDK descriptors and Eureqa/MARS currently does not better than 3 min in prediction accuracy. The reason here is for such a small but diverse set, the external validation set needs to be very large. I choose a split of 70 development, 30 test and 40 external validation compounds with exclusion of some compounds due to potential issues. You can get an R^2 of 0.99 min with a prediction accuracy of 0.2 min on test and training set, but then overfitting occurs and the prediction of the external set is very bad with max. errors of 10-20 minutes.

    Because many compounds elute below 10 min, the development set should adjust for that non-normality, therefore random sampling is not advises. Also because the compound sets itself are not equally distributed, the compounds should be assigned to general classes and then divided for training, testing and validation.

    There are also several trends for example sugars all elute around 1-2 minutes, other compounds elute according to their logP or logD. The pH dependencies are quite large, between 3 and 5 units for logD (the pH dependent logP). That behavior could be modeled using a tree that uses different regression models for those compound classes or elution bands (M5P or M5Rules). 

    Increasing diversity and number of compounds will surely lead to robust models, for exclusion filters not only the rms error of prediction should be minimized but the maximum error, hence not allowing any strong outliers.


  • Lochana Menikarachchi

    I guess I am late to reply to this thread. The HPLC retention indices and predictive models we’ve built in our lab might be the best in the field. We continue to improve our models by adding thousands of new compounds to training data. 

Read previous post:
What is the Future for Drug Discovery
What is the Future for Drug Discovery?

The future for drug discovery is dark grey. Big Pharma as we know it is going to to die. Only...