In this post I will go through some of the pitfalls of heatmap generation in R. I found that at least for me the heatmap functions do not always do what you think they do. When working with my own data I got convinced that something was wrong and investigated. I found that scaling was not done the way it would seem from the documentation of several functions. I eventually ended in on stackoverflow where Thomas W. Leja had explained what was going on.
Asking several colleagues no-one was aware of the issues I had found. I have therefore written up my findings here and in this post I will show in detail what to me doesn’t work as I expected and how do gain control of what precisely is done when you make a heatmap.
If you are surprised as well by the below please pass this on to everybody you know that might not be making heatmaps the way they think they are.
heatplot from made4
Lets start with heatplot from the made4 package just using the default settings.
Figure 1. Using heatplot defaults.
Yikes! This is not good for this data. This is because
heatplot by default does row scaling only. Fair enough. Lets use column scaling instead.
Figure 2. Heatplot with column scaling for the heat data only.
Much better! But wait! Did you see it? The colors in the plot changed. But the dendrogram did not! It turns out that the
scale argument only refers to the scaling of the heat data, NOT what happens to the scaling before calculation of the dendrograms.
So how is the data for the dendrogram actually scaled now??? It turns out that if you look at the code for the function
dualScale it does NOT do what it says.
dualScale applies row scaling before dendrogram calculations… And that is it. So now we have row scaling for the dendrograms and have added additional column scaling for the heat data. Hmmmm… not what we want either.
So lets turn off the row scaling and we do the column scaling before we give the data to
Figure 3. Column scaled data and then heatplot. No row scaling in heatplot.
Good then. That seems better, all is well. Well… There is a thing more that probably did not work as you had thought but is not very visible in this case.
heatplot has the argument zlim which by default reassigns all values smaller than -3 and all values above 3 to -3 and 3 respectively. This makes a lot of sense when you use unit variance scaling. If you don’t do that most of your values will be in the middle of the distribution which gives you very pale colors since only the very extreme values will use the full range of the color gradient.
The problem here is that
zlim is actually turned off when
dualScale=FALSE which was probably not obvious to you either. In this case it doesn’t do much damage but if the destribution of you data is tailed this won’t look nice.
Pheww… Now we understand
heatmap from stats and heatmap.2 from gplots
Now lets see if we can do the same plot with
heatmap uses different defaults for distance calculation and clustering so lets change
heatmap to use the same calculations and also make the color the same.
Figure 4. Column scaled data and then heatmap. No other scaling.
Oh no! Both dendrogram and the heat data is different. It turns out there are two reasons for this. For the dendrogram it turns out that
heatplot uses the dendrogram directly whereas
heatmap reorders the dendrogram based on row means. As far as I could find you cannot change this without calculating the dendrograms seperately.
For the colors it turns out that heatplot uses so-called symmetric breaks which means that 0 is put in the middle of the distribution of the data.
Btw.: And this is very important,
heatmap.2 (we will get to that one) has the same “feature” as
scale refers ONLY to the heat data, NOT the dendrogram calculation.
Lets jump right to
heatmap.2 which is basically a version of
heatmap with more options.
Figure 5. Column scaled data and then heatmap.2. No other scaling.
This looks similar to
heatmap.2 reorders as above while
heatplot does not.
Gaining full controls
If you are not busy hyperventilating at this point I will offer here a solution that will give the control we seek over when exactly what happens to the data.
I have created (well actually I have stolen and expanded the code provided by Thomas W. Leja) a function,
heat.clust, that will generate scaled data and dendrograms that can be passed to
heatmap.2so that it can be controlled what happens at which point. You can find the function in my package massageR.
Lets jump into an example.
Figure 6. Preparing data using heat.clust and then heatplot.2.
The parameters for
heat.clust do the following:
- scaledim: Selects to scale “column” or “row” or both (
scaledim = c("row","column")).
- zlim: Selects the limits of the data after scaling. Note that if you use anything else than unit variance scaling you should consider these limits. -3 to 3 probably doesn’t make sense in any other case. You can disable the limits using
- zlim_select: Selects when to apply
zlim. For the dendrogram and/or for the outputted (scaled) data that is passed to
- reorder: Selects if to reorder the dendrogram by “column” and/or “row” means.
- distfun: The function for distance calculation.
- hclustfun: The function for clustering.
- scalefun: the function to use for scaling.
Next we simply pass on the data and row and column dendrograms to
heatmap.2. We can then do additional scaling and set symmetric breaks if we want.
I hope you enjoyed the read and now feel you know exactly what the heatmap funcions do.
Exemplifying why it matters when you do scaling
If you are not convinced that the above is very important I will explain here why scaling is really really important.
Lets take the same data as before and do no scaling at all.
Figure 7. Preparing data using heat.clust with no scaling and then heatplot.2.
We can clearly see that this is no good. Now if we do the same just with scaling for the heat data we get.
Figure 8. Preparing data using heat.clust with no scaling and then heatplot.2 with column scaling of heat data.
Doesn’t look so bad right? Wrong! Compared to the last plot we did in the previous section you might think the clustering doesn’t look that different.
But lets look at what actually happened to the correlation matrix for the rows when the columns are not scaled.
Figure 9. Correlation plot of rows in the mtcars data.
Dang! Everything is super correlated! And if you plot two rows against each other we see why.
Figure 10. Variable 7 vs. variable 6 of the mtcars data. Data not scaled.
Because the variables are on so different scales fitting a line through seems pretty linear and the correlation is high…
But if we do the scaling we see the truth.
Figure 11. Variable 7 vs. variable 6 of the mtcars data. Data column scaled first.
Not correlated at all it turns out.
Equivalent plots using heat.clust + heatmap.2
In this section I replicate all the plots above so you can see exactly what each command do when using full control of what happens.
Figure 1 equivalent. Using heat.clust and heatmap.2.
Figure 2 equivalent. Using heat.clust and heatmap.2.
Figure 3 equivalent. Using heat.clust and heatmap.2.
Figure 4 equivalent. Using heat.clust and heatmap.2.
Figure 5 equivalent. Using heat.clust and heatmap.2.