In this tutorial I will show you how you can use the PredRet database to annotate your LC-MS metabolomics data directly in R. The annotation function I will be using is general-purpose, though, so you can use it to annotate any data where you have a list of compounds with known m/z’s and retention times (RTs).


In short PredRet is a user-driven database of compound retention times. The purpose of PredRet is to be able to predict the RT of a compound in one (your!) chromatographic system if it has been experimentally determined in another chromatographic system by someone, somewhere in the world. You can download the paper here and visit the project’s home page at PredRet.org.



Pulling data from PredRet

So lets get going.
First we will download the PredRet database with the PredRetR package. We will get both the experimental and the predicted values for the chromatographic system “LIFE_old”.

library(PredRetR)

predret_db <-    PredRet_get_db(exp_pred = c("exp","pred"),
                                systems = "LIFE_old"
                               )


Then lets take a look at the structure of the data.frame we have retrieved. If the recorded_rt column has data we have an experimentally determined RT, if the predicted_rt column has data we have a RT predicted with the PredRet systems. The ci_lower and ci_upper columns show the prediction interval for the predictions.

dim(predret_db)
head(predret_db)
## [1] 573  12
system name pubchem inchi date added username predicted suspect recorded_rt predicted_rt ci_lower ci_upper
LIFE_old (DL)-p-hydroxyphenyllactic acid 9378 InChI=1S/C9H10O4/c10-7-3-1-6(2-4-7)5-8(11)9(12)13/h1-4,8,10-11H,5H2,(H,12,13) 2014-09-02 15:20:11 jan FALSE FALSE 1.544350 NA NA NA
LIFE_old (R)-2-hydroxybutyric acid 11266 InChI=1S/C4H8O3/c1-2-3(5)4(6)7/h3,5H,2H2,1H3,(H,6,7) 2014-09-02 15:20:11 jan FALSE FALSE 1.399400 NA NA NA
LIFE_old 1-O-1'-(Z)-octadecenyl-2-hydroxy-sn-glycero-3-phosphocholine (LysoPC(P-18:0)) 24779527 InChI=1S/C26H54NO6P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-22-31-24-26(28)25-33-34(29,30)32-23-21-27(2,3)4/h20,22,26,28H,5-19,21,23-25H2,1-4H3 2014-09-02 15:20:11 jan FALSE FALSE 4.947717 NA NA NA
LIFE_old 2-Hydroxy-2-methylbutyric acid 95433 InChI=1S/C5H10O3/c1-3-5(2,8)4(6)7/h8H,3H2,1-2H3,(H,6,7) 2015-06-08 17:19:00 jan FALSE TRUE 1.625783 NA NA NA
LIFE_old 2-Hydroxy-3-methylbutyric acid 99823 InChI=1S/C5H10O3/c1-3(2)4(6)5(7)8/h3-4,6H,1-2H3,(H,7,8) 2015-06-08 17:19:00 jan FALSE TRUE 1.685467 NA NA NA
LIFE_old 2-methylacetoacetic acid 150996 InChI=1S/C5H8O3/c1-3(4(2)6)5(7)8/h3H,1-2H3,(H,7,8) 2014-09-02 15:20:11 jan FALSE FALSE 1.612850 NA NA NA


We have the RT directly in the database above but we do not have the m/z. Since we have the InChI the easiest way to get the m/z is to extract the molecular formula and then use the Rdisop package to get the mass.

library(Rdisop)
library(stringr)

formulas   <- str_split_fixed(predret_db[,"inchi"],"/",3)[,2]
masses     <- sapply(formulas, function(x) getMass(getMolecule(x))   )

predret_db <- cbind.data.frame(predret_db, mz_pos = masses + 1.0073)


We can then split the database in two. One for the experimental RTs and one for the predicted. Here I have used some pipe/dplyr style code and even a forward assign. If you are not familiar with this type of R code I urge you to look into it. It really makes code much more readable.

library(dplyr)
predret_db %>% filter(!is.na(recorded_rt)) -> predret_db_exp
predret_db %>% filter(is.na(recorded_rt))  -> predret_db_pred


Annotating a dataset

We now have the database ready for annotation.
So we can load a dataset/peaklist. This peaklist was previously created with XCMS and fragments/adducts annotated with CAMERA. But again any peaklist will do.

library(readxl)

data <- read_excel("../material/data/peaklist_pos.xlsx") %>% 
        replace(is.na(.), "") %>% 
        as.data.frame


Lets take a look at the interesting columns we have in the dataset.

info_cols <- c("mz","rt","isotopes","adduct","pcgroup")
dim(data)
## [1] 4232  470
head(    data[,info_cols]    )
##          mz       rt   isotopes              adduct pcgroup
## 1  62.06013 273.8827            [M+H-C5H8O]+ 145.11       1
## 2  95.08600 271.5567                                      1
## 3  98.09642 268.1695                                      1
## 4 104.10687 271.2400   [12][M]+ [M+H-COCH2]+ 145.11       1
## 5 105.11030 271.2415 [12][M+1]+                           1
## 6 114.09119 268.9771            [M+H-CH3OH]+ 145.11       1


Now we can use db.comp.assign from my chemhelper package to annotate the dataset.
We would probably want one column in our peaklist for the RTs we have determined experimentally and one for the predicted RTs. So we do the annotation twice.
The first two arguments to the function are the m/z and RT of the dataset. The next three are the name, m/z and RT of the database of known compounds. Lastly we give the tolerance for the m/z and RT for a database match.

library(chemhelper)

anno_exp <- db.comp.assign(mz           = data[,"mz"],
                           rt           = data[,"rt"],
                           comp_name_db = predret_db_exp[,"name"],
                           mz_db        = predret_db_exp[,"mz_pos"],
                           rt_db        = predret_db_exp[,"recorded_rt"] *60, # *60 since XCMS works in seconds but PredRet is in minutes.
                           mzabs        = 0.01,
                           ppm          = 15,
                           ret_tol      = 10
                           )
##                                         [,1]
## Unique hits                               86
## Non-unique hits                           10
## Compounds not found                      207
## Markers had multiple compounds assigned   50
anno_pred <- db.comp.assign(mz          = data[,"mz"],
                           rt           = data[,"rt"],
                           comp_name_db = predret_db_pred[,"name"],
                           mz_db        = predret_db_pred[,"mz_pos"],
                           rt_db        = predret_db_pred[,"predicted_rt"] *60,
                           mzabs        = 0.01,
                           ppm          = 15,
                           ret_tol      = 10
                           )
##                                         [,1]
## Unique hits                               34
## Non-unique hits                            2
## Compounds not found                      234
## Markers had multiple compounds assigned    5


Now lets put the annotations together with our dataset.

data_annotated <- cbind.data.frame(data,anno_exp,anno_pred)


The result

Then lets take a look at one of the feature groups (from CAMERA) where we got an annotation. In this example we have a feature that was annotated as tryptophan. The “OR” is because tryptophan was in the database several times. If multiple compounds would fit the m/z and RT they would be written with “OR” between them.
There is also a fragment annotated as adipic acid and 2-methylglutaric acid using a hit from the predicted RTs. In this case we have almost perfect CAMERA annotation suggesting the pseudo-molecular ion is the m/z = 205.0979 feature and the compound is very likely tryptophan.

data_annotated %>% filter(pcgroup==16) %>% select(one_of(info_cols,"anno_exp","anno_pred"))
mz rt isotopes adduct pcgroup anno_exp anno_pred
118.0657 91.31296 [M+H-NH3-CO-COCH2]+ 204.09 16
130.0655 91.40512 16
132.0808 91.27887 [44][M]+ [M+H-NH3-CO-CO]+ 204.09 16
133.0845 91.19346 [44][M+1]+ 16
142.0647 91.20588 [M+H-NH3-HCOOH]+ 204.09 16
144.0416 91.25665 16
144.0814 91.34230 [55][M]+ [M+H-NH3-CO2]+ 204.09 16
145.0846 91.28341 [55][M+1]+ 16
146.0599 91.33676 [58][M]+ [M+H-NH3-COCH2]+ 204.09 16
147.0647 91.36894 [58][M+1]+ 16 Adipic acid OR Adipic acid 2-Methylglutaric Acid
159.0922 91.28882 [71][M]+ [M+H-HCOOH]+ 204.09 16
160.0851 91.35138 [71][M+1]+ 16
170.0608 91.34747 [M+H-NH3-H2O]+ 204.09 16
188.0711 91.34805 [93][M]+ [M+H-NH3]+ 204.09 16
189.0743 91.28935 [93][M+1]+ 16
205.0979 91.32364 [102][M]+ [M+H]+ 204.09 16 tryptophan OR Tryptophan OR Tryptophan
206.1034 91.31068 [102][M+1]+ 16
245.1301 91.91434 [M+H+(CH3)2CO-H2O]+ (acetone cond.) 204.09 16
276.1823 91.41425 16
409.1873 91.28252 [265][M]+ [2M+H]+ 204.09 16
410.1906 91.26287 [265][M+1]+ 16
447.1336 91.29100 [2M+K]+ 204.09 [4M+2K]2+ 204.09 16


Lets take a look at another feature. This time we have experimental RTs that say the feature is either 1,7-dimethylxanthine OR theobromine but a prediction from the PredRet database also suggest that the feature could be theophylline as well.

data_annotated %>% filter(pcgroup==400) %>% select(one_of(info_cols,"anno_exp","anno_pred"))
mz rt isotopes adduct pcgroup anno_exp anno_pred
124.0508 89.67717 400
181.0725 89.56150 [86][M]+ [M+H-NH3-CO2-NH3-H2O]+ 276.119 [M+H-C4H6-COCH2]+ 276.119 400 1,7-dimethylxanthine OR theobromine theobromine OR theophylline OR 1,7-Dimethylxanthine
182.0761 89.59507 [86][M+1]+ 400
315.0806 89.70856 [M+K]+ 276.119 400


That’s it!