In this tutorial I will show you how you can use the PredRet database to annotate your LC-MS metabolomics data directly in R. The annotation function I will be using is general-purpose, though, so you can use it to annotate any data where you have a list of compounds with known m/z’s and retention times (RTs).
In short PredRet is a user-driven database of compound retention times. The purpose of PredRet is to be able to predict the RT of a compound in one (your!) chromatographic system if it has been experimentally determined in another chromatographic system by someone, somewhere in the world. You can download the paper here and visit the project’s home page at PredRet.org.
Pulling data from PredRet
So lets get going.
First we will download the PredRet database with the PredRetR package. We will get both the experimental and the predicted values for the chromatographic system “LIFE_old”.
Then lets take a look at the structure of the
data.frame we have retrieved. If the
recorded_rt column has data we have an experimentally determined RT, if the
predicted_rt column has data we have a RT predicted with the PredRet systems. The
ci_upper columns show the prediction interval for the predictions.
|LIFE_old||(DL)-p-hydroxyphenyllactic acid||9378||InChI=1S/C9H10O4/c10-7-3-1-6(2-4-7)5-8(11)9(12)13/h1-4,8,10-11H,5H2,(H,12,13)||2014-09-02 15:20:11||jan||FALSE||FALSE||1.544350||NA||NA||NA|
|LIFE_old||(R)-2-hydroxybutyric acid||11266||InChI=1S/C4H8O3/c1-2-3(5)4(6)7/h3,5H,2H2,1H3,(H,6,7)||2014-09-02 15:20:11||jan||FALSE||FALSE||1.399400||NA||NA||NA|
|LIFE_old||1-O-1'-(Z)-octadecenyl-2-hydroxy-sn-glycero-3-phosphocholine (LysoPC(P-18:0))||24779527||InChI=1S/C26H54NO6P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-22-31-24-26(28)25-33-34(29,30)32-23-21-27(2,3)4/h20,22,26,28H,5-19,21,23-25H2,1-4H3||2014-09-02 15:20:11||jan||FALSE||FALSE||4.947717||NA||NA||NA|
|LIFE_old||2-Hydroxy-2-methylbutyric acid||95433||InChI=1S/C5H10O3/c1-3-5(2,8)4(6)7/h8H,3H2,1-2H3,(H,6,7)||2015-06-08 17:19:00||jan||FALSE||TRUE||1.625783||NA||NA||NA|
|LIFE_old||2-Hydroxy-3-methylbutyric acid||99823||InChI=1S/C5H10O3/c1-3(2)4(6)5(7)8/h3-4,6H,1-2H3,(H,7,8)||2015-06-08 17:19:00||jan||FALSE||TRUE||1.685467||NA||NA||NA|
|LIFE_old||2-methylacetoacetic acid||150996||InChI=1S/C5H8O3/c1-3(4(2)6)5(7)8/h3H,1-2H3,(H,7,8)||2014-09-02 15:20:11||jan||FALSE||FALSE||1.612850||NA||NA||NA|
We have the RT directly in the database above but we do not have the m/z. Since we have the InChI the easiest way to get the m/z is to extract the molecular formula and then use the Rdisop package to get the mass.
We can then split the database in two. One for the experimental RTs and one for the predicted. Here I have used some pipe/dplyr style code and even a forward assign. If you are not familiar with this type of R code I urge you to look into it. It really makes code much more readable.
Annotating a dataset
We now have the database ready for annotation.
So we can load a dataset/peaklist. This peaklist was previously created with XCMS and fragments/adducts annotated with CAMERA. But again any peaklist will do.
Lets take a look at the interesting columns we have in the dataset.
Now we can use db.comp.assign from my chemhelper package to annotate the dataset.
We would probably want one column in our peaklist for the RTs we have determined experimentally and one for the predicted RTs. So we do the annotation twice.
The first two arguments to the function are the m/z and RT of the dataset. The next three are the name, m/z and RT of the database of known compounds. Lastly we give the tolerance for the m/z and RT for a database match.
Now lets put the annotations together with our dataset.
Then lets take a look at one of the feature groups (from CAMERA) where we got an annotation. In this example we have a feature that was annotated as tryptophan. The “OR” is because tryptophan was in the database several times. If multiple compounds would fit the m/z and RT they would be written with “OR” between them.
There is also a fragment annotated as adipic acid and 2-methylglutaric acid using a hit from the predicted RTs. In this case we have almost perfect CAMERA annotation suggesting the pseudo-molecular ion is the m/z = 205.0979 feature and the compound is very likely tryptophan.
|147.0647||91.36894||[M+1]+||16||Adipic acid OR Adipic acid||2-Methylglutaric Acid|
|205.0979||91.32364||[M]+||[M+H]+ 204.09||16||tryptophan OR Tryptophan OR Tryptophan|
|245.1301||91.91434||[M+H+(CH3)2CO-H2O]+ (acetone cond.) 204.09||16|
|447.1336||91.29100||[2M+K]+ 204.09 [4M+2K]2+ 204.09||16|
Lets take a look at another feature. This time we have experimental RTs that say the feature is either 1,7-dimethylxanthine OR theobromine but a prediction from the PredRet database also suggest that the feature could be theophylline as well.
|181.0725||89.56150||[M]+||[M+H-NH3-CO2-NH3-H2O]+ 276.119 [M+H-C4H6-COCH2]+ 276.119||400||1,7-dimethylxanthine OR theobromine||theobromine OR theophylline OR 1,7-Dimethylxanthine|