Academic Figures: Common Pitfalls and Best Practices

Things that annoy me - An opinionated guide

Jan Stanstrup

Introduction

Why This Matters

  • Figures are often the first (and sometimes only) thing readers look at

  • Poor quality figures look unprofessional

  • Low resolution images look bad when scaled up (e.g. posters)

  • Bad visualizations can be misleading

What We’ll Cover

Color Gradients

What is the Rainbow Scale?

Interactive Challenge: Can you order these colors?

The Correct Rainbow Order

Colors in order: Red → Orange → Yellow → Green → Cyan → Blue → Purple/Magenta

Comparing Rainbow Implementations

Why This Order?

This follows the visible light spectrum by wavelength:

  • Red: ~700 nm (longest)
  • Violet: ~400 nm (shortest)

But wavelength order ≠ perceptual order!

A Tale of Two Colormaps

Which one shows the data more accurately?

The data is smooth, yet Colormap A creates false boundaries!

Perceptual Non-Uniformity: Demonstrated

The data is perfectly smooth, yet rainbow creates artificial edges!

Real world consequences

Not just spatial data

Figure from Haseneyer et al. (2011)

Medical Consequences

Lives at Stake

Borkin et al. (2011) - IEEE Visualization

Studied physicians diagnosing heart disease using medical imaging:

  • Physicians using jet colormap: More errors, slower diagnosis
  • Physicians using perceptually uniform colormaps: Fewer errors, faster

Why?

  • Bright yellow appears more “intense” than dark red
  • But dark red represents higher values (more critical condition)
  • Perceptual bias leads to misdiagnosis

Reference: Borkin et al. (2011)

Comparison: Rainbow vs Better Alternatives

Notice how rainbow and heat have sharp transitions while viridis/magma are smooth!

Desaturated

Rainbow loses all information when desaturated! Viridis/magma remain readable.

The Viridis Color Scales

All viridis scales are perceptually uniform

The Viridis Color Scales, desaturated

Green-Blind Vision (Deuteranopia)

For ~8% of men, rainbow is nearly useless! Viridis/magma stay distinct.

Color Scale Comparison: Pros and Cons

Rainbow
Jet
Turbo
Heat
ggplot default
Brewer Blues
Viridis
Magma
Cividis
Perceptually uniform
Colorblind safe
B&W/grayscale safe
Good on projectors ⚠️ ⚠️ ⚠️
Print friendly
Engaging colors ⚠️ ⚠️ ⚠️ ⚠️
Wide color range ⚠️ ⚠️ ⚠️ ⚠️
Recommendation ❌ AVOID ❌ AVOID ⚠️ Careful ❌ AVOID OK Good ✅ DEFAULT ✅ Great ✅ Great

Viridis Family: Show All Options

All are perceptually uniform and colorblind-friendly!

Recommendation for continous scales

For most use cases: Use Viridis (or Magma/Plasma variants)

When to use something else:

  • Diverging data (has meaningful center): ColorBrewer diverging (RdBu, RdYlBu)
  • High colorblind audience: Cividis
  • Print-only publication: Brewer Blues or Greens

Never use: Rainbow or Jet

Colors for qualitative data

ColorBrewer: All Palettes

Explore all palettes at: colorbrewer2.org

ColorBrewer Website

Yellow Color Warning

The Yellow Problem

Even though ColorBrewer includes yellow in some palettes (e.g., “YlOrRd”, “RdYlBu”, “Set1”):

Yellow has serious issues:

  1. Poor printing: Yellow can be nearly invisible on white paper
  2. Projection problems: On projected slides, yellow often washes out
  3. Low contrast: Yellow text on white background is unreadable
  4. Photocopying: Disappears when photocopied in B&W

Recommendation:

  • For presentations: Avoid yellow-heavy palettes
  • For print: Use darker yellows or oranges instead
  • For text: NEVER use yellow text on light backgrounds

Removing yellow from Set1:

library(RColorBrewer)

# Set1 has yellow as the 6th color
set1_colors <- brewer.pal(9, "Set1")
set1_colors
# [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3" "#FF7F00" "#FFFF33" "#A65628" "#F781BF" "#999999"

# Remove yellow (position 6)
set1_no_yellow <- set1_colors[-6]

# Use in ggplot2
scale_color_manual(values = set1_no_yellow)

Recommendations Summary

Best Practices for Color

  1. Continuous data: Viridis family
    • scale_fill_viridis_c() or scale_color_viridis_c()
    • Options: “viridis”, “magma”, “plasma”, “inferno”, “cividis”
  2. Diverging data (meaningful center): ColorBrewer
    • scale_fill_distiller(palette = "RdBu") for continuous
    • scale_fill_gradient2() for custom diverging
  3. Categorical/qualitative data:
    • Viridis: scale_fill_viridis_d() or scale_color_viridis_d()
    • ColorBrewer: scale_fill_brewer(palette = "Set2") (up to 8-12 categories)
  4. AVOID:
    • ❌ Rainbow/jet colormaps
    • ❌ Red-green combinations (colorblind issue)
    • ❌ Yellow text or yellow-heavy palettes (visibility issue)
  5. ALWAYS TEST:
    • ✅ Grayscale conversion
    • ✅ Colorblind simulation
    • ✅ Print preview

Quick Reference: Code Examples

library(ggplot2)

# --- CONTINUOUS DATA ---

# Viridis continuous (best default)
ggplot(data, aes(x, y, fill = continuous_var)) +
  geom_raster() +
  scale_fill_viridis_c(option = "viridis")  # or "magma", "plasma", "cividis"

# ColorBrewer sequential continuous
ggplot(data, aes(x, y, fill = continuous_var)) +
  geom_raster() +
  scale_fill_distiller(palette = "Blues")  # or "YlOrRd", "Greens", etc.

# --- DIVERGING DATA (meaningful center) ---

# ColorBrewer diverging
ggplot(data, aes(x, y, fill = fold_change)) +
  geom_raster() +
  scale_fill_distiller(palette = "RdBu", direction = 1)

# Custom diverging
ggplot(data, aes(x, y, fill = fold_change)) +
  geom_raster() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0)

# --- CATEGORICAL/QUALITATIVE DATA ---

# Viridis discrete
ggplot(data, aes(x, y, color = category)) +
  geom_point() +
  scale_color_viridis_d(option = "viridis")

# ColorBrewer qualitative
ggplot(data, aes(x, y, color = category)) +
  geom_point() +
  scale_color_brewer(palette = "Set2")  # or "Dark2", "Paired", etc.

Heatmap Scaling

The Outlier Problem

Scenario: Metabolomics data (log2-transformed)

  • Most metabolites: 6 to 10 (log2 scale)
  • 1 outlier metabolite: ~10 (2^10 = 1000x higher!)

What happens with default scaling?

The extreme outlier compresses the color scale for all other values!

Visual Example: The Outlier Effect

Without Outlier Handling

  • Outliers dominate the color scale
  • Most data compressed into narrow range
  • Group differences invisible
  • Patterns lost 😱

Solution 1: Robust MAD Scaling and Cutoffs

# Step 1: Robust scaling using MAD (Median Absolute Deviation)
expr_scaled <- expr_data %>%
  mutate(across(-Sample, scale_mad))

# Step 2: Identify which values will be capped
expr_mat_scaled <- expr_scaled %>% column_to_rownames("Sample") %>% as.matrix()
capped_cells <- (expr_mat_scaled < -3) | (expr_mat_scaled > 3)

# Step 3: Cap at ±3 (meaningful after MAD scaling!)
expr_capped <- expr_scaled %>% mutate(across(-Sample, ~ pmin(pmax(.x, -3), 3)))

# Step 4: Create symmetric breaks centered at 0
max_abs <- max(abs(range(expr_capped[,-1])))
breaks  <- seq(-max_abs, max_abs, length.out = 101)
# Create asterisk markers for capped values (in original order)
asterisk_matrix <- matrix("", nrow = nrow(capped_cells), ncol = ncol(capped_cells))
asterisk_matrix[capped_cells] <- "*"

# Create final plot with asterisk markers
# display_numbers uses the original data order, clustering is applied automatically
p <- pheatmap(expr_capped %>% column_to_rownames("Sample"),
              main = "MAD-scaled + capped at ±3 (* = capped)",
              color = colorRampPalette(rev(brewer.pal(11, "RdBu")))(100),
              breaks = breaks, scale = "none",
              display_numbers = asterisk_matrix,
              number_color = "black",
              fontsize_number = 14,
              silent = TRUE)

Robust scaling approach

  • Use MAD instead of SD - not affected by outliers
  • Center by median (robust)
  • Scale by MAD (Median Absolute Deviation)
  • Then cap at ±3 MAD (~99% of normal data)
  • Outliers now exceed threshold!
  • Set scale = "none" (already scaled!)

Solution 2: Range scaling and Quantile-Based cut-off

# Step 1: Identify values outside 5-95 percentiles PER COLUMN
expr_mat_raw <- expr_data %>% column_to_rownames("Sample") %>% as.matrix()
capped_cells_q <- apply(expr_mat_raw, 2, function(x) {
  q_lower <- quantile(x, 0.05, na.rm = TRUE)
  q_upper <- quantile(x, 0.95, na.rm = TRUE)
  (x < q_lower) | (x > q_upper)
})

# Step 2: Cap at 5th and 95th percentiles PER COLUMN (metabolite)
expr_capped <- expr_data %>%
  mutate(across(-Sample, ~ cap_quantiles(.x, lower = 0.05, upper = 0.95)))

# Step 3: Range scaling (min-max normalization to [0,1])
expr_quantile <- expr_capped %>% mutate(across(-Sample, ~ (.x - min(.x)) / (max(.x) - min(.x))))
# Create asterisk markers for capped values (in original order)
asterisk_matrix_q <- matrix("", nrow = nrow(capped_cells_q), ncol = ncol(capped_cells_q))
asterisk_matrix_q[capped_cells_q] <- "*"

# Create final plot with asterisk markers
p <- pheatmap(expr_quantile %>% column_to_rownames("Sample"),
              main = "Capped at 5-95 percentiles + range-scaled (* = capped)",
              color = rev(viridis::magma(100)),
              display_numbers = asterisk_matrix_q,
              number_color = "white",
              fontsize_number = 14,
              silent = TRUE)

Quantile capping + range scaling

  • Cap first to remove outliers per metabolite
  • Then range scale to use full [0,1] color scale
  • More robust to outliers than variance scaling
  • Good for non-normal data
  • Common quantiles: 5-95% or 2-98%

Solution 3: Log Transformation

For positive values only (e.g., counts, intensities)

# Log transform BEFORE plotting (metabolomics data is positive-only)
expr_log_sol4 <- expr_data %>%
  mutate(across(-Sample, ~ log2(.x + 1)))  # +1 to handle zeros

p <- pheatmap(expr_log_sol4 %>% column_to_rownames("Sample"),
              main = "Log2 transformed metabolite intensities",
              color = rev(viridis::magma(100)),
              silent = TRUE)

For count/intensity data

  • Compresses wide ranges
  • Add +1 to handle zeros
  • Common for RNA-seq, proteomics
  • Use log2, log10, or ln

The Dendrogram Scaling Trap

Hidden Technical Issue

Critical R bug: Functions like heatmap(), heatmap.2(), and heatplot() have a dangerous inconsistency:

  • The scale parameter affects color visualization
  • But NOT dendrogram calculation!

Result: Dendrograms cluster on unscaled data while colors show scaled data!


P.S: pheatmap() seems to apply scaling and cropping before clustering!

Why Scaling Matters for Clustering

The Problem: High-variance features dominate correlations

Without scaling, features with large values dominate sample correlations!

Why Scaling Matters for Clustering (2)

Key point: Without scaling, high-variance metabolites completely dominate the correlation calculation between samples!

Scaling ensures all metabolites contribute equally to sample clustering.

Using massageR::heat.clust

Better Approach: heat.clust

The massageR package provides heat.clust() which handles scaling and dendrogram calculation correctly in one step!

Key advantages:

  • Scales data and calculates dendrograms together
  • Controls exactly where limits are applied (data and/or dendrograms)
  • Returns pre-computed dendrograms
  • Works seamlessly with pheatmap

heat.clust with pheatmap

library(massageR)

# Convert tibble to matrix for heat.clust
expr_matrix <- expr_data %>% column_to_rownames("Sample") %>% as.matrix()

# Use heat.clust with robust MAD scaling
z <- heat.clust(expr_matrix,
                scaledim = "column",           # Scale by column
                zlim = c(-3, 3),               # Cap at ±3 MAD
                zlim_select = c("dend", "outdata"),  # Apply to both
                reorder = c(),                 # Reorder dendrograms off for consistency
                distfun = function(x) dist(x),
                hclustfun = function(x) hclust(x, method = "complete"),
                scalefun = scale_mad)          # Use MAD scaling instead of default

max_abs <- max(abs(range(z$data)))
breaks  <- seq(-max_abs, max_abs, length.out = 101)
# Use with pheatmap
p <- pheatmap(z$data,
              cluster_rows = as.hclust(z$Rowv),
              cluster_cols = as.hclust(z$Colv),
              scale = "none",
              color = colorRampPalette(rev(brewer.pal(11, "RdBu")))(100),
              breaks = breaks,
              main = "heat.clust + pheatmap: Properly scaled!",
              silent = TRUE)

One-step workflow

  • Scales data automatically
  • Calculates dendrograms on scaled data
  • Caps at specified zlim
  • Returns everything needed for pheatmap
  • Ensures consistency throughout

Comparison: Before and After

Best Practices & Recommendations

Workflow

  1. Inspect data distribution before making heatmap
  2. Consider log transformation
  3. Scale data BEFORE passing to heatmap function
    • Use MAD (robust) if data has outliers: (x - median(x)) / mad(x)
    • Use SD if data is clean: scale(x)
  4. Cap extremes
    • Cap at ±3 MAD for robust scaling with outliers
    • Cap using quantiles (5-95%) if using range scaling
  5. Calculate dendrograms on the same scaled and capped data
  6. Consider massageR::heat.clust() for automatic proper scaling workflow
  7. Use appropriate palette:
    • Often centering data highlights contrasts
    • Diverging color scale for data that has been centered (red-white-blue)
    • Sequential color scale for one-directional data (viridis, magma)

When NOT to Cap

  • If outliers are biologically meaningful (rare events)
  • Small datasets where each value matters
  • When you want to highlight extreme values

Which image format for which purpose?

What is a Raster Image?

Raster graphics are grids of colored pixels

  • Stores individual pixel colors: RGB(255, 128, 64)
  • Fixed resolution measured in DPI (Dots Per Inch)
  • More pixels = higher resolution = larger file size
  • Cannot be scaled up without quality loss

What is a Vector Image?

Vector graphics use mathematical descriptions

  • Mathematical formulas define shapes and lines
  • “Draw a line from point (0,0) to (10,10)”
  • “Create a circle with center (5,5) and radius 3”
  • Infinitely scalable without quality loss

Infinite Resolution

Because vectors are mathematical formulas, they can be scaled to any size without losing quality. The curve is defined by equations, not pixels!

Vector vs. Raster Comparison

Vector vs. Raster Comparison (2)

Aspect Vector (PDF, SVG, EPS) Raster (PNG, TIFF, JPG)
Definition Mathematical formulas Grid of pixels
Scalability Infinite resolution Fixed resolution (DPI)
File Size Small (formulas compact) Large (all pixels stored)
Best For Plots, diagrams, text, screenshots of websites Photos, screenshots
Editability Easy to edit paths Pixel-level editing only
Text Quality Always crisp Can become blurry

Container Formats: PDF and TIFF

PDF and TIFF are Containers!

Both PDF and TIFF can contain EITHER vector OR raster data:

  • PDF: Can contain vector graphics, raster images, or both
  • TIFF: Usually raster, but can embed vector data

TIFF Compression Options:

Type Description Use Case
Uncompressed No compression (huge files) Archival
LZW Lossless compression Publications
ZIP Lossless compression Publications
JPEG Lossy compression Web (avoid for science)

JPEG Compression Artifacts

Format Properties Comparison

Format Type Compression Container? Notes
PDF Vector + Raster Lossless Yes Can embed both vector and raster data
EPS Vector Lossless Yes Older format required by some journals. Use device = "eps" in ggsave(). PDF is preferred when accepted.
SVG Vector Lossless No XML-based, web-native
PNG Raster Lossless No Supports transparency
TIFF Raster Lossless or Lossy Yes Multiple pages, various compression options
JPEG Raster Lossy No Best for photos only
WebP Raster Lossless or Lossy No Modern web format, smaller than PNG/JPEG

Container Formats

Container formats can hold multiple types of data or multiple images:

  • PDF: Can mix vector graphics, raster images, fonts, and text
  • EPS: Can embed fonts, preview images, and vector data
  • TIFF: Can contain multiple pages/images with different compression

Non-container formats store a single image with one encoding type.

File Format Comparison

Real world horror story

This is the graphical abstract

This was one of the figures

Vector Screenshots from Websites

Capturing Website Content as Vector Graphics

  1. Open the webpage in Google Chrome
  2. Open Chrome DevTools (F12)
  3. Click “⋮”
  4. Click “More Tools”
  5. click “Rendering”
  6. Set “Emulate CSS media type” to “screen”
  7. “Print”to PDF

Why This Matters

  • Text stays crisp - Perfect for including web-based figures
  • Smaller file size - Vector PDF is compact
  • Editable - Can extract or edit elements in PDF editor
  • Publication quality - No pixelation when zoomed

Examples:

  • Document online data visualization examples
  • Include web-based tools in presentations and papers

Format Selection: Decision Tree

%%{init: {'theme':'dark', 'themeVariables': {'edgeLabelBackground':'#1a1a1a', 'primaryTextColor':'#fff', 'secondaryTextColor':'#fff', 'tertiaryTextColor':'#fff'}}}%%

flowchart LR
    A[What type of image?] --> B{Photo}
    A --> C{Screenshot}
    A --> D{Generated figure/<br/>website snapshot}

    B --> B1{Where will it<br/>be used?}
    B1 -->|Publication| B2[TIFF LZW<br/>300+ DPI]
    B1 -->|Presentation/Web| B3[JPEG/WebP <br/> 150+ DPI]

    C --> C1{Where will it<br/>be used?}
    C1 -->|Publication| C2[TIFF LZW<br/>300+ DPI]
    C1 -->|Presentation/Web| C3[PNG<br/>150 DPI]

    D --> D1{Publication or<br/>presentation?}
    D1 -->|Publication| D2{Vector support?}
    D1 -->|Presentation/Web| D3[SVG or PNG]

    D2 -->|Yes| D4[PDF / SVG / EPS<br/>All equivalent vectors]
    D2 -->|No| D5[TIFF 600+ DPI LZW<br/>PNG not supported]

    style A fill:#5dade2,color:#fff
    style B fill:#5dade2,color:#fff
    style C fill:#5dade2,color:#fff
    style D fill:#5dade2,color:#fff
    style B1 fill:#5dade2,color:#fff
    style C1 fill:#5dade2,color:#fff
    style D1 fill:#5dade2,color:#fff
    style D2 fill:#5dade2,color:#fff
    style D4 fill:#2ecc71,color:#fff
    style D5 fill:#5dade2,color:#fff
    style D3 fill:#9b59b6,color:#fff
    style B2 fill:#3498db,color:#fff
    style B3 fill:#e74c3c,color:#fff
    style C2 fill:#3498db,color:#fff
    style C3 fill:#9b59b6,color:#fff

Last notes

  • Make sure that any art you include is vector based!
  • Composing your figures in Powerpoint is OK - but use high quality source-material
  • BioRender material is typically vector based

Text and Element Sizing

The Key Insight

The Smart Way to Control Text Size

Don’t manually set font sizes for every element!

Instead, use smaller figure dimensions to make text appear larger relative to the plot.

Then use vector formats (SVG/PDF) for infinite resolution.

Visual Demonstration

Small canvas (3.5 × 3 inches)

Text appears large relative to plot

ggsave("sizing_small.svg", p,
       width = 3.5, height = 3)

Large canvas (7 × 6 inches)

Text appears small relative to plot

ggsave("sizing_large.svg", p,
       width = 7, height = 6)

base_size Effect (theme_classic(base_size = x))

base_size = 8

base_size = 11 (default)

base_size = 14

Other elements scale relative to base_size:

  • axis.title: 1.1× base_size
  • axis.text: 0.8× base_size
  • legend.text: 0.8× base_size

Use base_size as the PRIMARY adjustment - only customize individual elements if needed

The Right Workflow

Three-Step Process

  1. Set dimensions to match final output size
    • Journal single column: ~3.5 inches (Always check specific journal guidelines!)
    • Journal double column: ~7 inches
    • Presentation: ~10 inches
  2. Adjust base_size if needed
    • Only fine-tune if text still too large/small
  3. Adjust individual elements if needed
    • Use theme() to customize specific text sizes
    • Only if steps 1-2 don’t achieve desired result

Why This Works

  • Text/points/lines have fixed sizes
  • If the canvas is small they appear larger

When you must adjust some sizes individually

Default (base_size = 11)

p + theme_classic(base_size = 11)

With manual adjustments

p + theme_classic(base_size = 11) +
  theme(
    axis.title   = element_text(size = 14, face = "bold"),
    axis.text    = element_text(size = 12),
    legend.title = element_text(size = 12, face = "bold"),
    legend.text  = element_text(size = 11)
  )

Vector vs Raster: Key Differences

Vector Formats (SVG, PDF)

Dimensions control text/element proportions, not quality!

  • Scale infinitely without quality loss
  • DPI is ignored
  • What matters: aspect ratio & relative proportions
# Text larger in small.svg
ggsave("small.svg", p,
       width = 3.5, height = 3)
ggsave("large.svg", p,
       width = 7, height = 6)

Raster Formats (PNG, TIFF)

DPI controls pixel count and quality!

  • Fixed resolution, can pixelate when scaled
  • DPI matters (300+ for print)
  • pixels = width × DPI
# 1500×1200 pixels
ggsave("plot.png", p,
       width = 5, height = 4, dpi = 300)
# 360×288 pixels
ggsave("plot.png", p,
       width = 5, height = 4, dpi = 72)

ggplot2 Themes

The Default Problem

# Default ggplot2 theme
p <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Default theme_gray()",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders") +
  theme(plot.title = element_text(face = "bold"))

Problems:

  • Gray background (wastes ink, unprofessional)
  • Too much “chart junk”
  • Not publication-ready

Built-in Theme: theme_bw()

p <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "theme_bw() - White background, black border",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold"))

Good for publications - Clean with reference gridlines

Built-in Theme: theme_classic()

p <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "theme_classic() - No gridlines, clean axes",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"))

Very minimal - Traditional journal style

Built-in Theme: theme_minimal()

p <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "theme_minimal() - Subtle gridlines, modern",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Good balance - Clean with subtle reference lines

Built-in Theme: theme_void()

p <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  labs(title = "theme_void() - Blank canvas") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        legend.position = "right")

For custom designs - Maps, minimalist graphics

Side-by-Side Comparison

Publication Package: ggpubr

# ggpubr - publication-ready themes and statistical annotations
# install.packages("ggpubr")

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(aes(fill = factor(cyl)), alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "ggpubr - Publication ready with stats",
       x = "Cylinders",
       y = "Miles Per Gallon") +
  theme_pubr() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold")) +
  scale_fill_brewer(palette = "Set2")

ggpubr Package

Publication-ready themes + statistical annotations

  • theme_pubr() - Clean publication theme
  • theme_pubclean() - Even more minimal
  • stat_regline_equation() - Automatic regression equations
  • stat_cor() - Correlation statistics
  • stat_compare_means() - p-values and significance brackets

ggpubr: Adding Regression Equations

library(ggpubr)

p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE,
              color = "darkred", formula = y ~ x) +

      stat_regline_equation(
      aes(label = after_stat(eq.label)),
      formula = y ~ x,
      label.x.npc = 0.95,  # 95% to the right (relative)
      label.y.npc = 0.95,  # 95% to the top (relative)
      hjust = 1            # right-align text
    ) +
    stat_cor(
      aes(label = paste(after_stat(rr.label), after_stat(p.label), sep = "~~~~")),
      label.x.npc = 0.95, label.y.npc = 0.88, hjust = 1
    ) +
  
  labs(title = "Linear Regression with Equation",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon") +
  theme_pubr()

Key function: stat_regline_equation()

  • Automatically calculates and displays equation
  • Shows R² value
  • Customizable position and formatting
  • Works with facets

ggpubr: Pairwise Comparisons

# Automatically generate all pairwise comparisons
dose_levels <- levels(factor(ToothGrowth$dose))
my_comparisons <- combn(dose_levels, 2, simplify = FALSE)

p <- ggboxplot(ToothGrowth,
               x = "dose", y = "len", color = "dose", palette = "jco") +
  stat_compare_means(
                      comparisons = my_comparisons,
                      method = "t.test",
                      p.adjust.method = "BH"  # Benjamini-Hochberg (FDR) correction
                    ) +
  stat_compare_means(
                      method = "anova",
                      label.y = 50
                    ) +
  labs(title = "Pairwise Comparisons with Multiple Testing Correction",
       x = "Dose (mg/day)",
       y = "Tooth Length")

  • combn(levels, 2) generates all pairs automatically
  • method = "t.test" for pairwise tests (or method = "tukey_hsd" for Tukey’s HSD)
  • p.adjust.method = "BH" for multiple testing correction (“holm”, “bonferroni”, “hochberg”, “BY”, “fdr”)
  • method = "anova" for overall test
  • Automatic significance brackets

Modern Typography: hrbrthemes

ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x) +
  facet_wrap(~gear, labeller = label_both) +
  labs(title = "hrbrthemes::theme_ipsum() - Modern typography",
       subtitle = "Clean, professional, with excellent fonts and facets",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders") +
  theme_ipsum() +
  scale_color_ipsum()

hrbrthemes Package

Modern professional typography

  • Uses high-quality fonts (requires font installation)
  • theme_ipsum() - Modern, clean, professional
  • theme_ipsum_rc() - Roboto Condensed font
  • Excellent for presentations and reports
  • Works beautifully with facets
  • May require: extrafont::font_import()

Specialized Themes: ggthemes

Setting Global Theme

Set once, apply to all plots:

# At top of script
theme_set(theme_bw(base_size = 12))

# Now all plots use theme_bw
# automatically
p1 <- ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  labs(title = "Plot 1")

p2 <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_point(aes(color = Species)) +
  labs(title = "Plot 2")

p3 <- ggplot(faithful, aes(eruptions)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(title = "Plot 3")

Font Considerations

# Check available fonts
# library(extrafont)
font_import()  # First time only (takes a while)
fonts()        # List available fonts

# Use in theme
theme_classic(base_family = "Arial") +
  theme(
    plot.title = element_text(family = "Arial", face = "bold"),
    axis.title = element_text(family = "Arial")
  )

Font Preferences by Journal

Many journals prefer specific fonts:

  • Arial - Most common, widely accepted
  • Helvetica - Classic choice
  • Times New Roman - Traditional journals
  • Calibri - Modern alternative

Check journal author guidelines!

Theme Elements You Can Customize

p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~gear, labeller = label_both) +
  labs(title = "Customized Theme Elements",
       x = "Cylinders",
       y = "Miles Per Gallon",
       fill = "Cylinders") +
  theme_minimal() +
  theme(
    # Axis elements
    axis.title = element_text(size = 12, face = "bold", color = "navy"),
    axis.text = element_text(size = 10, color = "gray30"),
    # Legend
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    legend.background = element_rect(fill = "gray95", color = "gray50"),
    # Panel
    panel.grid.major = element_line(color = "gray80", linewidth = 0.3),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    # Facet strips
    strip.background = element_rect(fill = "steelblue", color = "navy"),
    strip.text = element_text(color = "white", face = "bold", size = 11),
    # Plot
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_fill_brewer(palette = "Set2")

Customizable elements:

  • axis.title, axis.text - Axis labels
  • legend.position - “top”, “bottom”, “left”, “right”, “none”
  • panel.grid - Gridlines
  • plot.background, panel.background - Backgrounds
  • strip.background, strip.text - Facet labels

Recommendations

Best Practices

  1. Never use default theme_gray() for publications
    • Gray background wastes ink and looks unprofessional
  2. Set global theme at start of script for consistency
    • theme_set(theme_classic()) applies to all subsequent plots
  3. Match journal style - check published figures
    • Look at recent issues of your target journal
    • Note font choices, gridline presence, color schemes
  4. Keep it simple - less chart junk = better
    • Remove unnecessary gridlines
    • Minimize non-data ink
  5. Use publication packages for multi-panel figures
    • patchwork for intuitive combining syntax
    • ggpubr::ggarrange() for automatic labeling

Saving Plots in R

The Graphics Device System

Every plot needs a “device”

Device = where R sends the graphics output (device = “the printer”)

  • Screen (RStudio viewer)
  • PDF file
  • SVG file
  • PNG file
  • JPEG file
  • etc.

Old Way: Manual Device Management

pdf("myplot.pdf", width = 7, height = 5)
  plot(x, y)
dev.off()  # CRITICAL! Must close device

dev.off() Works with ALL R Plots

The dev.off() approach works for:

  • Base R plots (plot(), hist(), barplot(), etc.)
  • ggplot2 plots
  • Any R graphics output

It’s universal - not limited to any specific plotting system!

Problems with manual device management:

  • Easy to forget dev.off()
  • Verbose
  • Not intuitive
  • Must manage device lifecycle manually

Modern Way: ggsave()

For ggplot2 objects (recommended!)

p <- ggplot(mtcars, aes(wt, mpg)) + geom_point()

# Saves last plot by default
ggsave("myplot.pdf")

# Better: explicit plot object
ggsave("myplot.pdf", plot = p, width = 6, height = 4)

No dev.off() needed! ✨

Full Control

ggsave(
  filename = "figure1.pdf",
  plot = my_plot,
  width = 7,
  height = 5,
  units = "in",     # or "cm", "mm"
  dpi = 300,        # for raster formats
  device = "pdf"    # or "png", "svg", "tiff"
)

Batch Saving: Result

# View the resulting tibble
plot_data
# A tibble: 3 × 3
  Species    data               plot
  <fct>      <list>             <list>
1 setosa     <tibble [50 × 4]>  <gg>
2 versicolor <tibble [50 × 4]>  <gg>
3 virginica  <tibble [50 × 4]>  <gg>

Each row contains:

  • Species name
  • Nested data for that species
  • A ggplot object with regression

Example: Setosa species plot

File Organization

library(glue)

# Good practice: separate directory
fig_dir <- "figures"
dir.create(fig_dir, showWarnings = FALSE)

ggsave(glue("{fig_dir}/figure1.pdf"), p1, width = 7, height = 5)
ggsave(glue("{fig_dir}/figure1.png"), p1, width = 7, height = 5, dpi = 300)

# Save both vector and raster versions!

Using magick for Conversion

library(magick)

# Convert PDF to 300 DPI PNG
img <- image_read_pdf("plot.pdf", density = 300)
image_write(img, "plot.png", format = "png", quality = 100)

# With better antialiasing (remove alpha channel)
img <- image_read_pdf("plot.pdf", density = 300)
img <- image_background(img, "white")  # Remove alpha
image_write(img, "plot.png", format = "png", quality = 100)

# Batch convert all PDFs in directory
pdf_files <- list.files(pattern = "\\.pdf$")
for (file in pdf_files) {
  img <- image_read_pdf(file, density = 300)
  img <- image_background(img, "white")
  out_file <- sub("\\.pdf$", ".png", file)
  image_write(img, out_file, format = "png", quality = 100)
}

Batch Saving Multiple Plots

Make a plotting function ::: {.cell}

make_species_plot <- function(data, species) {
  ggplot(data, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(size = 2, color = "steelblue") +
    geom_smooth(method = "lm", se = TRUE, color = "darkred", formula = y ~ x) +
    labs(title = glue("Iris {species}"), x = "Sepal Length (cm)", y = "Sepal Width (cm)") +
    theme_classic(base_size = 12)
}


Nest data per Species ::: {.cell}

# Nest data by Species and apply plotting function
plot_data <- iris %>%
  nest(data = -Species) %>%
  mutate(plot = map2(data, Species, make_species_plot))


Write out to separate file per Species

# Save all plots using walk2
walk2(plot_data$plot, plot_data$Species, ~ggsave(
  filename = glue("{fig_dir}/iris_{..2}.pdf"),
  plot = ..1,
  width = 6, height = 5
))

::::

# A tibble: 3 × 3
  Species    data              plot      
  <fct>      <list>            <list>    
1 setosa     <tibble [50 × 4]> <ggplt2::>
2 versicolor <tibble [50 × 4]> <ggplt2::>
3 virginica  <tibble [50 × 4]> <ggplt2::>
head(plot_data$data[[1]])
# A tibble: 6 × 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1          5.1         3.5          1.4         0.2
2          4.9         3            1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5           3.6          1.4         0.2
6          5.4         3.9          1.7         0.4


plot_data$plot[[1]]

::::

Recommendations

Best Practices

  1. Always use ggsave() for ggplot2 (not manual devices)
  2. Always specify width, height, units and DPI (for raster output)
  3. Increase DPI from the default for all raster formats set ≥300
  4. Use cairo_pdf for PDF output
  5. Save both PDF and PNG versions
  6. Organize figures in dedicated directory

Post-Export Editing

Why Edit After Export?

Legitimate uses:

  • Combine multiple plots into multi-panel figures (A, B, C labels)
  • Fine-tune alignment and spacing
  • Add annotations, arrows, or highlights
  • Adjust layout without re-running analysis

Vector Editing Tools

Inkscape

  • Free & open source
  • Native SVG format
  • Excellent PDF import
  • Cross-platform
  • Full-featured vector editor
  • Converts between vector formats
  • Slow…

Other options:

  • Adobe Illustrator: Industry standard (commercial)
  • PowerPoint: Can edit SVG as vector format

⚠️ Avoid Raster Editors!

DO NOT use Photoshop, GIMP, or other raster editors for plots!

  • These convert your plots to pixels (rasterization)
  • You lose scalability and editability
  • Text becomes uneditable
  • Quality degrades when resized

Keep it vector! Use Inkscape, Illustrator, or PowerPoint with SVG input and PDF output.

Inkscape Basics

Opening PDFs/SVGs:

  1. File → Open → Select PDF or SVG
  2. Each plot element is now editable

Useful tools:

  • Selection tool (F1): Move and resize
  • Text tool (F8): Edit or add text
  • Align and Distribute (Ctrl+Shift+A)
  • Guides (drag from rulers): Align elements precisely

Tips:

  • Group related elements (Ctrl+G)
  • Lock layers to prevent accidental edits
  • Use layers for complex figures

Handling Missing Fonts:

Keep the font names! Don’t substitute - preserves original font info and prevents text reflow issues

Cropping Canvas to Remove Whitespace

The Page Tool approach:

  1. Select the objects you want to keep (or Select All (Ctrl+A))
  2. Use Edit → Resize Page to Selection (or Ctrl+Shift+R)
  3. This resizes the canvas to fit your selection

Hidden Objects from R Plots!

R exports contain many invisible/empty objects that prevent proper cropping!

The frustrating whack-a-mole:

  • Press Ctrl+A (Select All) to reveal hidden objects
  • You’ll see many white/empty rectangles
  • Delete these empty objects first before resizing page
  • Otherwise canvas includes invisible whitespace

Why clipping doesn’t work:

  • Object → Clip → Set can destroy plot elements
  • Don’t use clipping - delete empty objects instead

Example: Before and After Editing

Original R output

After editing in Inkscape

Changes made:

  • Cropped whitespace
  • Added annotations
  • Made legend more compact

Combine figures directly in R instead

patchwork: Side by Side

A better alternative to manual composition!

library(patchwork)

p1 | p2 | p3

patchwork: Stacked Layout

p1 / p2 / p3

patchwork: Grid Layout

(p1 | p2) / (p3 | p4)

Adding Panel Labels (A, B, C…)

(p1 | p2) / (p3 | p4) +
  plot_annotation(tag_levels = 'A')

Customizing Panel Labels

(p1 | p2) / (p3 | p4) +
  plot_annotation(
    tag_levels = 'A',
    tag_prefix = '(',
    tag_suffix = ')'
  ) &
  theme(plot.tag = element_text(face = 'bold', size = 14))

Unequal Panel Sizes

# First plot takes 2x width
p1 + p2 + p3 +
  plot_layout(widths = c(2, 1, 1)) +
  plot_annotation(tag_levels = 'A')

Nested Layouts

# Large plot on left, two stacked on right
p1 | (p2 / p3) +
  plot_annotation(tag_levels = 'A')

Shared Legends with plot_layout()

# Create plots with same color mapping
pa <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point() + theme_classic(base_size = 10)
pb <- ggplot(mtcars, aes(hp, mpg, color = factor(cyl))) +
  geom_point() + theme_classic(base_size = 10)

pa + pb +
  plot_layout(guides = 'collect') +
  plot_annotation(tag_levels = 'A')

Shared Axes with plot_layout()

# Share y-axes for side-by-side plots
(p1 | p2) + plot_layout(axes = "collect_y")

  • Default behavior: Each plot keeps its own axes
  • collect_x: Remove duplicate x-axes when plots are stacked vertically (same x-scale)
  • collect_y: Remove duplicate y-axes when plots are side-by-side (same y-scale)
  • collect: Remove both x and y axes (same scales in both directions)
  • Apply to groups where it makes sense before combining
# Complex: collect within groups first
((p1 | p2) + plot_layout(axes = "collect_y")) /
 (p3 | p4)

Benefits:

  • Cleaner appearance
  • Less redundant labels
  • Easier to compare across plots
  • Saves space

Saving patchwork Figures

# Create combined figure
combined <- (p1 | p2) / (p3 | p4) +
  plot_annotation(tag_levels = 'A')

# Save as vector (recommended)
ggsave("figure1.svg", combined, width = 10, height = 8)
ggsave("figure1.pdf", combined, width = 10, height = 8)

# Save as high-res raster if needed
ggsave("figure1.png", combined, width = 10, height = 8, dpi = 300)

patchwork vs Manual Editing: Decision Guide

patchwork (in R)

✅ Fully reproducible

✅ Easy to update

✅ Automatic alignment

✅ Consistent styling

✅ Version controlled

⚠️ Less layout flexibility


Use patchwork when:

  • All panels are ggplot2 objects
  • Need reproducible figures
  • Figures may need updates
  • Sharing code with collaborators

Inkscape (manual editing)

✅ Pixel-perfect control

✅ Mix with non-R content

✅ Complex annotations

❌ Not reproducible

❌ Manual re-editing

❌ Easy to break


Use Inkscape when:

  • Need pixel-perfect alignment
  • Adding photos, diagrams, or non-R content
  • Very complex custom layouts
  • Final publication polish only

PowerPoint Import

The Copy-Paste Problem

What happens when you copy from RStudio:

  • Pastes as low-resolution bitmap
  • Looks OK on screen (72 DPI)
  • Terrible when projected
  • Pixelated and blurry
  • Doesn’t scale well
  • The “Copy Plot to Clipboard” → “Copy as Metafile” corrupts plot symbols. Save as SVG instead.

The Solution: Save First, Insert Second

Never copy-paste!

Instead:

  1. Save plot as file
  2. Insert file into PowerPoint
  3. Maintain quality

Vector Formats for PowerPoint

Three options:

  1. SVG: Good support in modern PowerPoint → Use this!

    • Editable after import
    • Preserves vector format
    • Exports as SVG/PDF maintain vector quality
  2. EMF: Windows only, obsolete and of no benefit in newer PowerPoint. Requires devEMF

  3. PDF: Very poorly supported! Low resolution import (rasterized!) and no editing

  4. PNG: If you must use raster, then 300+ DPI

SVG Editing in PowerPoint

SVG is Editable in PowerPoint

Modern PowerPoint supports SVG editing:

  1. Insert SVG file into PowerPoint
  2. Right-click → Ungroup (or Convert to Shape)
  3. Individual elements become editable
  4. Modify colors, text, positions
  5. Compose multi-panel figures
  6. Export as SVG/PDF to preserve vector format

PowerPoint can be your figure composition tool!

Ungrouping May Break Complex SVGs

Be careful when ungrouping:

  • Complex SVGs may lose gradients, patterns, or effects
  • Some plot elements might break apart unexpectedly
  • Text rendering may change
  • Clipping paths may be lost

Recommendation:

  • Keep a backup copy before ungrouping
  • Test with your specific plots first
  • For complex figures, consider Inkscape instead if editing is needed.

Factor Ordering

The Problem: Alphabetical Ordering

# Create data with logical categories
category_df <- data.frame(
  category = c("Low", "Medium", "High", "Very High", "Low", "High"),
  value = c(10, 20, 15, 30, 12, 25)
)

# R defaults to alphabetical!
p1 <- ggplot(category_df, aes(x = category, y = value)) +
  geom_col(fill = "coral") +
  theme_classic(base_size = 12) +
  labs(title = "Alphabetical (Wrong!)")
# Specify levels explicitly
category_df$category <- factor(category_df$category,
                       levels = c("Low", "Medium", "High", "Very High"))

# Now plots use logical order!
p2 <- ggplot(category_df, aes(x = category, y = value)) +
  geom_col(fill = "steelblue") +
  theme_classic(base_size = 12) +
  labs(title = "Logical Order (Correct!)")

Random order makes no sense!

Much better - order makes sense!

Solution: forcats Package

Part of tidyverse, designed for factor manipulation

library(forcats)

# Reorder car classes by median highway mpg
p <- mpg %>%
  ggplot(aes(x = fct_reorder(class, hwy, median),
             y = hwy)) +
  geom_boxplot(fill = "lightblue") +
  coord_flip() +
  labs(x = "Vehicle Class",
       y = "Highway MPG",
       title = "Ordered by median MPG")

Boxplots now ordered by median value!

fct_reorder(): Order by Another Variable

# Order diamond cuts by mean price
p <- diamonds %>%
  group_by(cut) %>%
  summarise(mean_price = mean(price)) %>%
  ggplot(aes(x = fct_reorder(cut, mean_price),
             y = mean_price,
             fill = cut)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(x = "Diamond Cut",
       y = "Mean Price ($)",
       title = "Cuts ordered by average price")

Bars ordered by length!

fct_infreq(): Order by Frequency

# Order by how common each vehicle class is
p <- ggplot(mpg,
            aes(x = fct_infreq(class))) +
  geom_bar(fill = "steelblue") +
  coord_flip() +
  labs(x = "Vehicle Class",
       y = "Count",
       title = "Most common classes first")

Most common category first!

Great for survey data

fct_inorder(): Order by Appearance

# Create data with specific order
treatment_data <- data.frame(
  treatment = c("Control", "Low Dose",
                "Medium Dose", "High Dose",
                "Control", "Low Dose",
                "Medium Dose", "High Dose"),
  response = c(10, 12, 15, 18,
               11, 13, 16, 19)
)

# Keep order as they appear in data
p <- ggplot(treatment_data,
       aes(x = fct_inorder(treatment),
           y = response)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "Treatment", y = "Response")

Preserves the order from your data!

Natural Sorting: var1, var2, …, var10

The problem with alphabetical sorting:

# Create data with numbered variables
var_data <- data.frame(
  variable = rep(c("var1", "var2", "var10", "var20"), each = 5),
  value = rnorm(20, mean = rep(c(10, 15, 20, 25), each = 5), sd = 2)
)

# Alphabetical order: var1, var10, var2, var20 (wrong!)
p <- ggplot(var_data, aes(x = variable, y = value)) +
  geom_boxplot(fill = "coral") +
  labs(title = "Alphabetical: var1, var10, var2, var20")

Natural Sorting: The Solution

# Use gtools::mixedsort() for natural/alphanumeric sorting
library(gtools)

var_data$variable <- factor(var_data$variable,
                            levels = mixedsort(unique(var_data$variable)))

# Natural order: var1, var2, var10, var20 (correct!)
p <- ggplot(var_data, aes(x = variable, y = value)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "Natural sort: var1, var2, var10, var20")

Note: forcats::fct_inseq() only works if factor levels are purely numeric strings (e.g., “1”, “2”, “10”), not mixed alphanumeric like “var1”, “var10”

fct_rev(): Reverse Order

# Reverse the frequency order
p <- ggplot(mpg, aes(x = fct_rev(fct_infreq(class)))) +
  geom_bar(fill = "coral") +
  coord_flip() +
  labs(x = "Vehicle Class", y = "Count",
       title = "Least common classes first (reversed)")

Now least common first instead of most common!

fct_relevel(): Move Specific Levels

# Move "Control" to front for treatment groups
treatment_data <- data.frame(
  treatment = c("Low Dose", "High Dose", "Control",
                "Medium Dose", "Low Dose", "Control"),
  response = c(12, 18, 10, 15, 13, 11)
)

p <- treatment_data %>%
  mutate(treatment = fct_relevel(treatment, "Control")) %>%
  ggplot(aes(treatment, response)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "Treatment", y = "Response")

Control always shown first

Common in experimental data!

Facet Ordering

# Order facets by median highway mpg for each vehicle class
p <- mpg %>%
  filter(class %in% c("pickup", "minivan", "compact")) %>%
  mutate(class = fct_reorder(class, hwy, median)) %>%
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  facet_wrap(~class) +
  labs(x = "Engine Displacement (L)", y = "Highway MPG")

Facet panels in meaningful order!

⚠️ The Danger of Numeric Factors

Converting between factor and numeric can destroy your data!


# Original numeric data
doses <- c(10, 20, 50, 100, 200)
print(doses)
[1]  10  20  50 100 200
# Convert to factor (common in data import!)
dose_factor <- factor(doses)
print(dose_factor)
[1] 10  20  50  100 200
Levels: 10 20 50 100 200


# Looks fine, right? But look at the internal structure:
str(dose_factor)
 Factor w/ 5 levels "10","20","50",..: 1 2 3 4 5
levels(dose_factor)
[1] "10"  "20"  "50"  "100" "200"


# Try to convert back to numeric - WRONG!
as.numeric(dose_factor)
[1] 1 2 3 4 5
# Correct way: convert via character
as.numeric(as.character(dose_factor))
[1]  10  20  50 100 200


The danger: Many functions silently convert factors to integers!

⚠️ Missing Levels: The Silent Data Corruption

Even worse: missing levels get renumbered!

  subject response
1       1       10
2       2       15
3       4       25
4       5       30
5       1       11
6       2       16
7       4       24
8       5       29
# Convert to factor (happens during import!)
subject_factor <- factor(subject)
str(subject_factor)
 Factor w/ 4 levels "1","2","4","5": 1 2 3 4 1 2 3 4

Levels: “1” “2” “4” “5” - looks OK…


# Try to convert back - DATA CORRUPTION!
as.numeric(subject_factor)
[1] 1 2 3 4 1 2 3 4

Returns: 1 2 3 4 1 2 3 4

Your subject 4 became 3!

Your subject 5 became 4!


This is catastrophic for analysis!

Your statistical models and plots will use the wrong subject numbers. Always use as.numeric(as.character(factor)) not as.numeric(factor).
Or better yet. NEVER use numbers for categorical data!

⚠️ Numeric Factors After Reordering

dose_data <- tibble(
  dose = factor(c(0, 10, 50, 100, 200)),
  response = c(5, 25, 80, 70, 30)
) %>%
  mutate(dose_ordered =
         fct_reorder(dose, response))

dose_data
# A tibble: 5 × 3
  dose  response dose_ordered
  <fct>    <dbl> <fct>       
1 0            5 0           
2 10          25 10          
3 50          80 50          
4 100         70 100         
5 200         30 200         
# Try to use numerically - DISASTER!
dose_data <- dose_data %>%
  mutate(
    wrong = as.numeric(dose_ordered),
    correct = as.numeric(as.character(dose_ordered))
  )
# A tibble: 5 × 5
  dose  response dose_ordered wrong correct
  <fct>    <dbl> <fct>        <dbl>   <dbl>
1 0            5 0                1       0
2 10          25 10               2      10
3 50          80 50               5      50
4 100         70 100              4     100
5 200         30 200              3     200

Recommendations: Avoid Numeric Factors

Best Practices

  1. Never store numeric values as factors unless they represent categories
  2. Prefix categorical numbers: Use “Group_1”, “Group_2” instead of “1”, “2”
  3. Check imported data: CSV imports often convert numbers to factors
  4. Use readr::read_csv() instead of read.csv() - better type detection and no implicit conversion to factors.
  5. Always convert via character: as.numeric(as.character(factor)) not as.numeric(factor)
  6. Check with str() before analysis to verify data types

Bad: Numeric categories

# DON'T do this - ambiguous!
groups <- factor(c(1, 2, 4, 5))
str(groups)
 Factor w/ 4 levels "1","2","4","5": 1 2 3 4
# Are these numbers or categories?

Good: Prefixed categories

# DO this - clearly categorical!
groups <- factor(c("Group_1", "Group_2",
                   "Group_4", "Group_5"))
str(groups)
 Factor w/ 4 levels "Group_1","Group_2",..: 1 2 3 4
# Obviously categories, safe!

forcats Cheat Sheet

library(forcats)

fct_reorder(f, x, fun)   # Order by another variable
fct_infreq(f)            # Order by frequency
fct_inorder(f)           # Order by appearance in data
fct_inseq(f)             # Order by numeric value (if purely numeric)
fct_rev(f)               # Reverse current order
fct_relevel(f, "A", "B") # Move specific levels to front
fct_recode(f, new = "old") # Rename levels
fct_lump_n(f, n = 5)     # Keep top n, lump others as "Other"
fct_explicit_na(f)       # Make NA a visible level

# For natural sort (var1, var2, var10):
factor(x, levels = gtools::mixedsort(unique(x)))

Debugging Factor Issues

# Example: vehicle classes in mpg dataset
vehicle_class <- factor(mpg$class)

# Check level order
vehicle_class %>% levels()
[1] "2seater"    "compact"    "midsize"    "minivan"    "pickup"    
[6] "subcompact" "suv"       
# See factor structure (shows integer encoding)
str(vehicle_class)
 Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
# Check how many observations per level
table(vehicle_class)
vehicle_class
   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
# Convert to character if needed for text operations
class_char <- as.character(vehicle_class)

Key Takeaways & Best Practices

Remember

The Problem:

  • R defaults to alphabetical factor ordering - usually wrong!
  • Numeric factors are dangerous - converting destroys your data
  • Missing levels get renumbered - group 4 becomes 3!

The Solutions:

  • forcats package provides powerful ordering tools:
    • fct_reorder() orders by another variable (most useful!)
    • fct_infreq() orders by frequency
    • fct_relevel() to put control/baseline first
    • fct_rev() to reverse order
  • Manual levels for logical ordering (Low/Med/High, months, etc.)
  • Prefix categorical numbers: “Group_1” not “1”

Always:

  • Check factor order before plotting with str() or levels()
  • Convert via character: as.numeric(as.character(f)) not as.numeric(f)
  • Think about your reader - what order makes sense?

Interactive Plots

The Appeal of Interactivity

Why use interactive plots?

  • 🖱️ Hover to see exact values
  • 🔍 Zoom and pan
  • 👁️ Toggle traces on/off
  • 📊 Great for exploration
  • 🎯 Excellent for presentations

Basic ggplotly Example

library(ggplot2)
library(plotly)

# Create clean data for tooltips
mtcars_clean <- mtcars
mtcars_clean$Cylinders <- factor(mtcars$cyl)

# Create ggplot
p <- ggplot(mtcars_clean, aes(wt, mpg, color = Cylinders)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Fuel Efficiency by Weight",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon",
       color = "Cylinders")

Try hovering over points!

  • See exact values
  • Toggle series on/off
  • Zoom and pan
# Make interactive
ggplotly(p_example)

Customizing Tooltips

# Control what appears on hover
mtcars_clean <- mtcars
mtcars_clean$Cylinders <- factor(mtcars$cyl)

p <- ggplot(mtcars_clean, aes(wt, mpg, color = Cylinders,
                        text = paste("Car:", rownames(mtcars),
                                   "<br>HP:", hp))) +
  geom_point(size = 3) +
  theme_minimal()

Now hover shows car name and horsepower!

Use the text aesthetic and tooltip parameter to customize what appears on hover.

ggplotly(p_tooltips, tooltip = c("text", "x", "y"))

The Saving Problem

Challenge

Interactive plots are HTML widgets, not static images!

Can’t just save as PDF or PNG traditionally.

Solution: Save as HTML (Keep Interactivity)

library(htmlwidgets)

# Create a plot and make it interactive
p_interactive <- ggplotly(p_save)

# Save as self-contained HTML
saveWidget(p_interactive_save,
           "plots/10_interactive/interactive_plot.html",
           selfcontained = TRUE)

About selfcontained parameter:

  • selfcontained = TRUE: Bundles all JavaScript/CSS into one file
    • Perfect for emailing or sharing
    • No external dependencies needed
    • Larger file size
  • selfcontained = FALSE: Creates separate library files
    • Smaller main HTML file
    • Requires folder structure to be maintained

Uses:

  • Email the HTML file directly
  • Upload to website
  • Opens in any web browser!

Quarto/Markdown Integration

In Quarto, ggplotly works seamlessly:

```{r}
#| label: quarto-example
#| eval: false

library(plotly)
library(ggplot2)

# Create plot
p <- ggplot(mtcars, aes(wt, mpg)) + geom_point()

# Make it interactive
ggplotly(p)
```

Perfect for modern scientific reports!

3D Plots

# 3D scatter plot
mtcars_3d <- mtcars
mtcars_3d$Cylinders <- factor(mtcars$cyl)

plot_ly(mtcars_3d,
        x = ~wt, y = ~hp, z = ~mpg,
        color = ~Cylinders,
        type = "scatter3d",
        mode = "markers")

Rotate by clicking and dragging!

Great for exploring multivariate data in 3D space.

Linked Plots with Crosstalk

# Create shared data
mtcars$car_name <- rownames(mtcars)
shared_data <- SharedData$new(mtcars, ~car_name)

# Create linked plots
p1 <- plot_ly(shared_data, x = ~wt, y = ~mpg, type = 'scatter', mode = 'markers')
p2 <- plot_ly(shared_data, x = ~hp, y = ~mpg, type = 'scatter', mode = 'markers')

# Display with filter on top
bscols(widths = 12, filter_checkbox("cyl", "Cylinders:", shared_data, ~cyl, inline = TRUE))
subplot(p1, p2, nrows = 1, shareY = TRUE)

Hybrid Approach: Both Versions

# Create output directory if needed
dir.create("plots/10_interactive", recursive = TRUE, showWarnings = FALSE)

# Create base plot
p_static <- ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  theme_classic(base_size = 12)

# Static version for publication
ggsave("plots/10_interactive/figure1.pdf", p_static, width = 7, height = 5)

# Interactive version for supplement/website
p_interactive <- ggplotly(p_static)
saveWidget(p_interactive, "plots/10_interactive/figure1_interactive.html")

# Best of both worlds!

Sharing Interactive Plots

Options:

  1. Email HTML file (if selfcontained = TRUE)
  2. Upload to web server
  3. GitHub Pages (free hosting)
  4. Shiny app (for more complex interactions)

Limitations of ggplotly

Not all ggplot2 features convert:

  • Some geoms don’t translate well
  • Complex annotations may be lost
  • Custom themes partially supported
  • Facets work but can be slow

Test your conversion!

Alternative: ggiraph

library(ggiraph)

# Create plot with interactive elements
mtcars$car_name <- rownames(mtcars)

# Create rich tooltips with HTML formatting
mtcars$tooltip_text <- paste0(
  "<b>", mtcars$car_name, "</b><br>",
  "Weight: ", round(mtcars$wt, 2), " (1000 lbs)<br>",
  "MPG: ", mtcars$mpg, "<br>",
  "HP: ", mtcars$hp, "<br>",
  "Cylinders: ", mtcars$cyl
)

p <- ggplot(mtcars, aes(wt, mpg,
                        tooltip = tooltip_text,
                        data_id = car_name)) +
  geom_point_interactive(aes(color = factor(cyl)), size = 3) +
  theme_minimal()
girafe(ggobj = p_ggiraph)

ggiraph offers better ggplot2 compatibility:

  • Built specifically for ggplot2 (not a conversion layer)
  • Preserves more complex themes and annotations
  • Better control over tooltips and interactions
  • Uses special _interactive geoms

Hover to see tooltips, click to select!

Key Takeaways

Best Practices & Summary

Core concepts:

  • Interactive plots are HTML widgets, not images
  • Use ggplotly() to convert ggplot2 plots instantly
  • Use saveWidget() with selfcontained = TRUE to save

When to use what:

  • Interactive: exploration, presentations, HTML reports
  • Static: publications, print, PDF reports

Tips:

  1. Create both versions when possible
  2. Sample large datasets to keep HTML file manageable
  3. Test thoroughly - not all ggplot2 features convert
  4. Consider ggiraph if ggplotly doesn’t preserve your styling
  5. Use selfcontained = TRUE for easy sharing

References

Key Papers and Books

Key Papers

Essential research on color use, perception, and data visualization:

  • Crameri, Shephard, and Heron (2020) - The misuse of color in science communication
  • Borkin et al. (2011) - Evaluation of visualization effectiveness
  • Gołębiowska and Çöltekin (2022) - Problems with rainbow color schemes
  • Heron, Crameri, and Shephard (2021) - Rainbow colormaps and their issues

Books

Color Resources

Tools

  • ColorBrewer - Interactive tool for selecting color palettes for maps and data visualization
  • Coblis - Color Blindness Simulator - Upload images to see how they appear with different types of color vision deficiency

R Packages

  • viridis - Perceptually uniform color scales - https://cran.r-project.org/package=viridis
  • RColorBrewer - ColorBrewer palettes for R
  • colorspace - Color manipulation and assessment
  • dichromat - Simulate color blindness
  • paletteer - Collection of 2000+ palettes

Getting help

Documentation

Community

Software

Essential Tools

  • R - The R programming language
  • RStudio - Integrated development environment
  • Quarto - Scientific and technical publishing system

Helpful Extensions

Inspiration

Outstanding Examples

  • This presentation: https://stanstrup.github.io/figure_presentation/
  • Heatmaps guide: https://stanstrup.github.io/heatmaps.html

Contributing

Found a useful resource not listed here? Contributions are welcome!

  • GitHub: https://github.com/stanstrup/figure_presentation
  • Issues: Report problems or suggest additions
  • Pull Requests: Contribute improvements

Citation

If you find this guide useful in your work, please cite:

Stanstrup, J. (2025). Academic Figures: Common Pitfalls and Best Practices.
https://stanstrup.github.io/figure_presentation/