More on data visualization

More details on data visualization. Know-how and -why of aesthetics. Basic themes and extensions. Guides: axes, labels, scales, breaks, and legends. Statistical transformations. Text and annotations.

Published

February 14th, 2025

1 Prerequisites

  • We examine some aspects of ggplot2 in more detail.
library(ggplot2)
  • We will use the eu_ict and ict data frames in our visualizations.
eu_ict <- readRDS("data/eu_ict.rds")
ict <- readRDS("data/ict.rds")

2 Programming digression: argument passing

  • This is not a necessity.

2.1 Argument order and named arguments

  • There are three ways to pass arguments to functions in R.
  1. By exact matching.
  2. By partial matching.
  3. By position.

2.2 Exact matching

  1. By exact matching.
  • We specify the argument name and the value.
  • The names of the arguments are documented in the help files of the functions.
  • Documentation is available with ?function_name.

Exact matching

  1. By exact matching.
  • From ?ggplot, we see that the signature of ggplot() is:
Usage:

     ggplot(data = NULL, mapping = aes(), ..., 
            environment = parent.frame())
ggplot(
  data = eu_ict,
  mapping = aes(
    x = output, 
    y = ict_percentage
  )
) +
  geom_point()

2.3 Partial matching

  1. By partial matching.
  • We specify partially the argument name and the value.
  • We can provide the minimum number of initial characters (or more) that uniquely identify the argument.

Partial matching

  1. By partial matching.
ggplot(
  d = eu_ict,
  m = aes(
    x = output, 
    y = ict_percentage
  )
) +
  geom_point()

ggplot(
  d = eu_ict,
  map = aes(
    x = output, 
    y = ict_percentage
  )
) +
  geom_point()

2.4 Positional arguments

  1. By position
  • We specify only values.
  • Arguments are matched by their position in the function signature.

Positional arguments

  1. By position
Usage:

     ggplot(data = NULL, mapping = aes(), ..., 
            environment = parent.frame())
ggplot(
  eu_ict,
  aes(x = output, y = ict_percentage)
) +
  geom_point()

Positional arguments

  • The same applies to the aes() function.
Usage:

     aes(x, y, ...)
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point()

2.5 Why so many ways?

  • Each approach has pros and cons.

Why so many ways?

  • Exact matching is quite verbose.
  • But it is self-documenting and less error-prone.
  • It is a good practice to use it when readability is important. For example:
    • In scripts that are shared with others that are not familiar with the used functions.
    • When calling functions that you do not use frequently, and you might need to revisit the code after a long time.

Why so many ways?

  • Positional matching is concise.
  • It is a good practice to use it for commonly used functions where the risk of confusion is low. E.g.,

    ifelse(x > 10, "large", "small")

    instead of

    ifelse(test = x > 10, yes = "large", no = "small")
  • Using it in R’s command line for experimentation can be easier.

  • But it can make reading code less self-contained.

3 Aesthetics mappings

3.1 Defining aesthetics

  • Aesthetics can be defined at various levels when creating a plot.
  • Globally in the ggplot() call.
ggplot(
  eu_ict,
  aes(
    output, ict_percentage, 
    color = income
  )
) +
  geom_point()

Defining aesthetics

  • Aesthetics can be defined at various levels when creating a plot.
  • Locally at the level of each layer.
ggplot(eu_ict) +
  geom_point(
    aes(
      output, ict_percentage, 
      color = income
    )
  )

Defining aesthetics

  • Aesthetics can be defined at various levels when creating a plot.
  • Mixed at the ggplot() and layer level.
  1. How does this work?
  2. Why does it work in this way?
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income))

3.2 Defining aesthetics: how

  1. How does this work?
  • Global assignments in ggplot() affect all layers in the plot.
  • Local assignments in geom_*() only apply to that layer.
  • Local assignments take precedence over (override) global assignments.
ggplot(
  eu_ict,
  aes(
    output, ict_percentage, 
    color=income
  )
) +
  geom_point(aes(color = EU))

3.3 Defining aesthetics: why

  1. Why does it work in this way?
  • We can specify the aesthetics of certain layers with attributes that we do not want to apply globally.

3.4 Defining aesthetics: why

  1. Why does it work in this way?
ggplot(
  eu_ict, 
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(
  eu_ict, 
  aes(
    output, ict_percentage, 
    color = income
  )
) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Defining aesthetics

  1. Why does it work in this way?
  • We also have the flexibility to extend the aesthetics of distinct geometric objects differently.
ggplot(
  eu_ict, aes(output, ict_percentage)
) +
  geom_point(
    aes(color = income, shape = EU)
  ) +
  geom_smooth(
    method="lm", aes(color = EU)
  )
`geom_smooth()` using formula = 'y ~ x'

4 Theming

  • Aesthetics modify the appearance of the plot’s data elements.
  • How can we modify the appearance of the plot’s non-data elements (e.g., axes, background, grid)?
  • The ggplot2 package provides a set of eight basic themes.

4.1 Basic themes

Basic themes

Basic themes

  • We can apply themes to the plot by adding them with the + operator.
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE
  ) +
  theme_classic()

ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE
  ) +
  theme_bw()

Basic themes

  • Themes only affect the non-data elements of the plot.
  • To modify the appearance of the data elements, we still need to use aesthetics.
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE
  ) +
  theme_bw()

Basic themes

  • Themes only affect the non-data elements of the plot.
  • To modify the appearance of the data elements, we still need to use aesthetics.
ggplot(
  eu_ict,
  aes(output, ict_percentage, color = income, shape = EU)
) +
  geom_point() +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE,
    color = "black"
  ) +
  theme_bw() +
  scale_color_grey()

Basic themes

  • Themes only affect the non-data elements of the plot.
  • Here, we have combined the theme_bw() with scale_color_grey() to modify the appearance of the data elements.
  • In addition, we have explicitly specified the color of the geom_smooth() object to be black.
ggplot(
  eu_ict,
  aes(output, ict_percentage, color = income, shape = EU)
) +
  geom_point() +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE,
    color = "black"
  ) +
  theme_bw() +
  scale_color_grey()

4.2 Additional themes

  • If the basic themes do not meet the stylistic requirements that you want or being asked to follow, taking a look at the ggthemes package is a good idea.

Additional themes

  • The ggthemes package provides additional themes that might match the desired style.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(aes(linetype = EU, group = EU), se = FALSE) +
  ggthemes::theme_excel_new()

5 Guides

  • Guides are reference lines, grids, or markers assisting in interpreting the geometric object of the visualization.
  • Axes and legends are the two guides that are most commonly modified in a visualization to facilitate communication.

5.1 Axes

  • Axes are the typically horizontal and vertical lines that specify the coordinate system of the plot area.
  • Axes have breaks (ticks) and labels.
    • Breaks are the marked points of an axis.
    • Labels are the text accompanying the breaks and provide interpretation context for the axis.

5.2 Labels

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")
  • How can we modify the breaks of an axis?
  • How can we modify the labels of these breaks?

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")
  • Not very intuitively, the breaks and labels of an axis are not modified through labs().
  • This is because ggplot2 does some heavy lifting for us when drawing the axes of a plot.
  • Recall that we have used the same calling interface for creating plots with continuous and discrete axes variables (e.g., geom_point and geom_bar).

5.3 Scales

  • In the background, ggplot2 automatically adjusts the axes based on the type of the variable we provide.
  • It does so by using the scale_*() family of functions.
  • Scales are instructions controlling how certain aesthetic mappings are translated into visual properties.
  • For example, a continuous scale maps the values of an aesthetic to a continuous axis range.

Scales

  • In ggplot2, continuous variables in geom_point() objects are automatically assigned to a continuous scale scale_x_continuous().
  • Discrete variables in geom_bar() objects are automatically assigned to a discrete scale scale_color_discrete().
  • We can modify the default behavior and the appearance of axes by explicitly calling the scale_*() functions.

5.4 Breaks

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %") +
  scale_x_continuous(
    breaks = c(2004, 2014, 2023)
  )

  • We can pass directly the breaks we want to have on a continuous axis using the breaks argument.
  • For instance, if we want to have all the years as breaks, we can pass the year column of the ict data frame.

Breaks

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line()  +
  labs(x = "Year", y = "ICT employment %") +
  scale_x_continuous(breaks = ict$year)

Breaks

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line()  +
  labs(x = "Year", y = "ICT employment %") +
  scale_x_continuous(breaks = ict$year)
  • In addition, if we want to modify the labels of the breaks, we can use the labels argument of scale_x_continuous().
  • Suppose, for example, that instead of having the years on the x-axis, we want to have labels formatted as Year YYYY, where YYYY is the year.

5.5 Breaks and their labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  scale_x_continuous(
    breaks = seq(2004, 2023, 2),
    labels = paste("Year", seq(2004, 2023, 2))
  )

5.6 Programming digression: creating sequences

  • We have used the seq() function to create the breaks and labels of the x-axis.
  • The seq() function creates sequences of numbers.
  • There are a few ways to create sequences in R.

Programming digression: creating sequences

  • The legacy way of creating sequences is to use the : operator.
  • The : operator is used with infix notation.
  • It takes two arguments, from and to, and creates a sequence of integers from from to to.
10:20
 [1] 10 11 12 13 14 15 16 17 18 19 20

Programming digression: creating sequences

  • The : operator has a few disadvantages.
  • First, it only works with a step of 1 or -1 if the from is smaller than the to.
  • Second, it can be error-prone when combined with arithmetic operations.
1:3 * 2
[1] 2 4 6
1:(3 * 2)
[1] 1 2 3 4 5 6

Programming digression: creating sequences

  • A safer and more flexible way to create sequences is to use the seq() function.
  • The seq() function can create sequences with an arbitrary step size.
seq(1, 3, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0

Programming digression: creating sequences

  • A safer and more flexible way to create sequences is to use the seq() function.
  • Or it can create sequences between two numbers with a specific length.
seq(1, 3, length.out = 10)
 [1] 1.000000 1.222222 1.444444 1.666667 1.888889 2.111111 2.333333 2.555556
 [9] 2.777778 3.000000

Programming digression: creating sequences

  • A safer and more flexible way to create sequences is to use the seq() function.
  • Further, compared to the : operator, there is less risk of confusion when combining seq() with arithmetic operations.
seq(1, 3) * 2
[1] 2 4 6
seq(1, 3 * 2)
[1] 1 2 3 4 5 6

Programming digression: creating sequences

  • There are two very useful siblings of seq(), named seq_along() and seq_len().

Programming digression: creating sequences

  • The seq_along() function creates a sequence of integers from 1 to the length of the input vector.
v1 <- c("a", "b", "c")
seq_along(v1)
[1] 1 2 3
  • This is useful when we want to enumerate the elements of a vector.
  • Compared to:
v1 <- c("a", "b", "c")
seq(length(v1))
[1] 1 2 3

Programming digression: creating sequences

  • The seq_len() function creates a sequence of integers from 1 to the input number.
seq_len(5)
[1] 1 2 3 4 5
  • It gives the same result as seq(1, 5).

5.7 Rotating breaks’ labels

  • When plotting high-frequency time-series data, the labels of the breaks on the horizontal axis can get crowded.
  • One common way to address this issue is to rotate the labels.
  • Rotating the labels does not affect the breaks or the labels themselves, only their orientation.
  • We can rotate the breaks’ labels using the theme() function.

Rotating breaks’ labels

  • The theme() function has (a lot of) options for modifying the plot’s theming.
  • We can modify the appearance of the axes using the axis.text.x and axis.text.y arguments.
  • The element_text() function is used to modify the appearance of the labels’ text.
  • Rotating the labels is done by setting the angle argument to the desired angle (in degrees).
  • The vjust and hjust arguments control the vertical and horizontal justification of the text.

Rotating breaks’ labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  scale_x_continuous(breaks = ict$year, labels = ict$year) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))

5.8 Legends

  • Another useful option exposed by theme() is the legend.position argument.
ict |>
  dplyr::filter(geo %in% sample(unique(ict$geo), size = 5)) |>
  ggplot(aes(year, ict_percentage, color = geo)) +
  geom_line() +
  theme(legend.position = "top")

Legends

  • The legend.position argument can take the following values:
    • "none": no legend is displayed.
    • "left", "right", "top", "bottom": the legend is displayed on the left, right, top, or bottom of the plot area.
    • "inside": the legend is displayed inside the plot area.

Legends

  • Besides customization via theme(), legends can be modified using the guides() function.
  • The guides() function offers more fine-grained control over the appearance of the legend.
  • For example, we can modify the number of rows or columns of the legend.
  • Or we can override the size and shape of the legend markers.

Legends

ict |>
  dplyr::filter(
    geo %in% sample(unique(ict$geo), size = 8)
  ) |>
  ggplot(
    aes(year, ict_percentage, color = geo)
  ) +
  geom_line() + 
  theme(legend.position = "top") +
  guides(
    color = guide_legend(
      title = "Country",
      nrow = 2,
      override.aes = list(linewidth = 4)
    )
  )

6 Statistical transformations

  • How are data mapped to geometric objects?
  • When creating a bar chart, we pass one column to geom_bar(), and the function automatically calculates the height of the bars.
  • When creating a density plot, we pass one column to geom_density(), and the function automatically calculates the density of the data.

6.1 Behind every geometric object

  • How are data mapped to geometric objects?
  • This pattern is common in ggplot2.
  • The passed data is transformed into a new form that is used to create the plot.
  • We examine some more details of the statistical transformations taking place behind the scenes when creating geometric objects.

6.2 The statistic behind geom_bar()

eu_ict |>
  ggplot(aes(income)) +
  geom_bar()

  • Where is the count variable of the vertical axis coming from?

The statistic behind geom_bar()

  • Where is the count variable of the vertical axis coming from?
eu_ict |>
  ggplot(aes(income)) +
  geom_bar()
  • We have never defined count as an aesthetic.
  • Even stranger, count is not among the columns of the eu_ict dataset.
names(eu_ict)
[1] "geo"            "EU"             "ict_percentage" "income"        
[5] "output"        

The statistic behind geom_bar()

  • Examining the documentation of geom_bar(), we observe that there is a stat argument that defaults to count.

Usage:

 geom_bar(
   mapping = NULL,
   data = NULL,
   stat = "count",
   position = "stack",
   ...,
   just = 0.5,
   width = NULL,
   na.rm = FALSE,
   orientation = NA,
   show.legend = NA,
   inherit.aes = TRUE
 )

The statistic behind geom_bar()

  • Behind the scenes, geom_bar() calculates the number of times each value of income is found in the data.
eu_ict |>
  dplyr::count(income)
# A tibble: 3 × 2
  income     n
  <fct>  <int>
1 low        3
2 middle    17
3 high      12
  • And then uses the new variable to set the heights.

The statistic behind geom_bar()

  • We can manually replicate the calculation and instruct geom_bar() not to perform any further transformation.
  • Instructing a geom_* function not to apply any statistical transformation to the input data is done by passing stat = "identity" to the function.
eu_ict |>
  dplyr::count(income) |>
  dplyr::rename(count = n) |>
  ggplot(aes(income, count)) +
  geom_bar(stat = "identity")

6.3 The statistics behind geom_smooth()

  • Other geom_* functions calculate different statistics by default.
  • For instance, geom_smooth() calculates fitted values, standard errors, and confidence intervals.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  stat_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

6.4 Programming digression: formulas

  • How can we replicate the geom_smooth()’s statistics?
  • The "lm" part of the method argument stands for linear model.
  • Linear models are statistical models having linear relationships between the dependent and independent variables.
  • For example, classic linear regressions are linear models.

Programming digression: formulas

  • In R, we have a neat way to define statistical models using formulas.
  • Using formulas with ggplot2 and statistical functions allows us to focus on relationships in the data and leave the details of the statistical calculations to the functions.

Programming digression: formulas

  • A basic formula in R has two main parts:

    • The left-hand side (LHS) of the formula is the dependent variable.
    • The right-hand side (RHS) of the formula is the independent variable(s).
    • The two parts are separated by a tilde ~ symbol.
  • For example:
ict_percentage ~ output
ict_percentage ~ output

Programming digression: formulas

  • Had we had more than one independent variable, we would have written:
ict_percentage ~ output + ind_var2 + ind_var3
ict_percentage ~ output + ind_var2 + ind_var3
  • Note that we used the variables ind_var1 and ind_var2 in the formula, which neither were defined nor exist in any of our datasets.
  • And R does not complain about it.

Programming digression: formulas

  • Had we had more than one independent variable, we would have written:
ict_percentage ~ output + ind_var2 + ind_var3
ict_percentage ~ output + ind_var2 + ind_var3
  • This is because the formula does not actually calculate anything.
  • It is an unevaluated expression that explains the logic of the model.

Linear regressions

  • We can fit a linear model using formulas with the lm() function.
fit <- lm(ict_percentage ~ output, eu_ict)
  • The first argument of lm() is the formula we want to estimate.
  • The second argument is the dataset.
  • The lm() function automatically searches for the formula variables in the dataset and fits the model.

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • Symbolically, we estimated the model,

    \[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \varepsilon_{i}, \]

    • \(y_{i}\) is the ict_percentage,
    • \(x_{i}\) is the output,
    • \(\beta_{0}\) is the intercept,
    • \(\beta_{1}\) is the slope, and
    • \(\varepsilon_{i}\) is the error term.

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the predicted values from the model?

    \[ \hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{i} \]

pred_y <- predict(fit)

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the confidence intervals?

\[ ce(\hat{y}_{i}) = [\hat{y}_{i} - t_{\alpha/2} \times \sigma(\hat{y}_{i}), \hat{y}_{i} + t_{\alpha/2} \sigma (\hat{y}_{i})] \]

where

\[ \sigma(\hat{y}_{i}) = \hat\sigma \sqrt{\frac{1}{n} + \frac{(x_{i} - \bar{x})^{2}}{\sum_{i=1}^{n} (x_{i} - \bar{x})^{2}}} \]

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the confidence intervals?
pred_y <- predict(fit, interval = "confidence")
  • When passing interval = "confidence", the predict() function returns a matrix with three columns:
    • fit: the predicted values,
    • lwr: the lower bound of the confidence interval, and
    • upr: the upper bound of the confidence interval.
  • We can examine the first few rows with head().

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the confidence intervals?
pred_y <- predict(fit, interval = "confidence")
  • We can examine the first few rows with head().
head(pred_y)
       fit      lwr      upr
1 5.277672 4.814736 5.740609
2 5.251859 4.792837 5.710880
3 3.871531 3.201614 4.541448
4 6.498425 5.648143 7.348707
5 4.865592 4.427362 5.303822
6 4.368092 3.852660 4.883525
  • We can now replicate the geom_smooth()’s statistics.

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
pred_y <- predict(fit, interval = "confidence")
eu_ict |>
  dplyr::mutate(
    pred = pred_y[, "fit"],
    ymin = pred_y[, "lwr"],
    ymax = pred_y[, "upr"]
  ) |>
  ggplot(aes(output)) +
  geom_line(aes(y = pred), color = "blue", linewidth = 1) +
  geom_ribbon(
    aes(ymin = ymin, ymax = ymax),
    fill = "darkgray",
    alpha = 0.5
  )

eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  stat_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

7 Text and annotations

  • A common pattern in data science visualizations is the use of annotations and text.
  • Text and annotations are commonly used to:
    • Provide context
    • Highlight specific data points
    • Explain the data
    • Add captions

7.1 Captions

  • Captions can be effortlessly added to a figure with labs().

Captions

ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = EU)) +
  geom_smooth(method = "lm") +
  labs(caption = "Data Source: Eurostat")
`geom_smooth()` using formula = 'y ~ x'

Captions

ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = EU)) +
  geom_smooth(method = "lm") +
  labs(caption = "Data Source: Eurostat")
`geom_smooth()` using formula = 'y ~ x'

7.2 Labels with Formulas

  • On some occasions, data scientists want to include the underlying mathematical details of their work in a visualization.
  • For example, suppose we are working with the function \[ f(x) = \sin(2x) + e^{-x/10} \cdot \cos(x) \]
  • We can create a plot of \(f\) with geom_line().

Labels with Formulas

\[ f(x) = \sin(2x) + e^{-x/10} \cdot \cos(x) \]

data <- data.frame(x = seq(0, 50, .01))
ggplot(data, aes(x = x)) +
  geom_line(aes(y = sin(2*x) + exp(-x/10) * cos(x)), color = "red")

Labels with Formulas

  • Notice that ggplot2 writes the formula expression as a string in the vertical axis label.
  • However, the used formatting is rather unusual for human readers.
  • For instance, multiplication is denoted with *, while in mathematical typography it is usually omitted.
  • We can use quote() in labs() to instruct ggplot2 to render the expression in a more human-customary way.

Labels with Formulas

data <- data.frame(x = seq(0, 50, .01))
ggplot(data, aes(x = x)) +
  geom_line(aes(y = sin(2*x) + exp(-x/10) * cos(x)), color = "red") +
  labs(x = quote(x), y = quote(sin(2 * x) + exp(-x/10) * cos(x)))

7.3 Annotations

  • Besides captions and labels, we can add text and markers directly into the main body of the plot.
  • These types of additions to a plot are called annotations.
  • An annotation is an additional piece of information that is added to a plot and facilitates the interpretation of its data elements.
  • In ggplot2, annotations and text can be added with annotate() and geom_text().

7.4 Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point()

  • We start with the usual geom_point() plot.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(aes(label = geo))

  • We pass the label = geo aesthetic to geom_text() to create a text object using country names.
  • However, this creates a text object for all data points.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo)
  )

  • We override the data argument of geom_text() to filter only the non-EU countries.
  • This looks more like what we want to achieve.
  • Still, the text and the points are overlapping.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo),
    hjust = "left",
    vjust = "top"
  )

  • We use hjust = "left" and vjust = "top" to align the text to the top-left corner.
  • Better, but some extra spacing could improve the aesthetics.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo),
    hjust = "left",
    vjust = "top",
    nudge_y = -0.1
  )

  • We use nudge_y = -0.1 to move the text slightly below its data point.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo),
    hjust = "left",
    vjust = "top",
    nudge_y = -0.1,
    size = 4
  )

  • Finally, we can adjust the text size with the size argument.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.

7.5 Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  )
`geom_smooth()` using formula = 'y ~ x'

  • Suppose we want to add the country names at the end of each regression line.
  • We can pick the maximum output value per group and use it as the x aesthetic in geom_text().

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  ) +
  geom_text(
    data = eu_ict |>
      dplyr::group_by(EU) |>
      dplyr::slice_max(output, n = 1)
  )
  • We use slice_max() to pick the maximum output value per group.
  • And override the data argument of geom_text() to use the sliced data.

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  ) +
  geom_text(
    data = eu_ict |>
      dplyr::group_by(EU) |>
      dplyr::slice_max(output, n = 1),
    aes(output, ict_percentage, label = EU)
  )
`geom_smooth()` using formula = 'y ~ x'

  • We pass the aesthetics we want to use in the mapping argument of geom_text().

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  ) +
  geom_text(
    data = eu_ict |>
      dplyr::group_by(EU) |>
      dplyr::slice_max(output, n = 1),
    aes(output, ict_percentage, label = EU),
    hjust = "left",
    vjust = "bottom",
    nudge_x = 1000
  )
`geom_smooth()` using formula = 'y ~ x'

  • And fine-tune the appearance of the text with hjust, vjust, and nudge_x.

7.6 Annotating

  • Another way to add text to a plot is with annotate().
  • In contrast to geom_text(), which is a geometric object, annotate() does not act on data points.
  • This means that annotate() does not require a data argument.
  • And it is more useful for adding small, data-independent elements to a plot.

7.7 Using annotate() for text

  • We want to add a label next to the richest EU country.
  • We start once more with a geom_point() scatter plot of the eu_ict’s income data.

Using annotate() for text

  • We want to add a label next to the richest EU country.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU))

Using annotate() for text

  • We want to add a label next to the richest EU country.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(geom = "label")
  • And add an annotate() of geometric type label.
  • On its own, this gives an error if executed.
  • We need to provide the label text for the annotation.
  • And specify the position for the label using the x and y arguments.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(geom = "label")
  • We can do some preliminary data transformations to find the richest EU country.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output,
    y = richest$ict_percentage
  )

  • And use the calculated richest data to pass more information to annotate().

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output,
    y = richest$ict_percentage,
    hjust = "right"
  )

  • We can adjust the label’s text horizontal alignment.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output,
    y = richest$ict_percentage,
    hjust = "right"
  )

  • The annotate() function does not have nudge_* arguments (why?).
  • We can directly adjust the x and y positions to move the label around.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 1000,
    y = richest$ict_percentage,
    hjust = "right"
  )

Using annotate() for text

7.8 Using annotate() for segments

  • Using annotate() is very useful for adding segments and arrows to a plot.
  • The calling interface is mostly similar to annotate() for text.
  • Instead of geom = "label", we use geom = "segment" to create annotations with segments and arrows.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right"
  )

  • We nudge the label of the richest country a bit more to the left.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right"
  ) +
  annotate(geom = "segment")
  • And add a new annotate() layer with geom = "segment" to create a segment.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right"
  ) +
  annotate(geom = "segment")
  • Specifying a segment on a plane is equivalent to specifying two points.
  • The annotate() function expects two points to draw the segment.
  • The points are specified by the x, y, xend, and yend arguments.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right",
  ) +
  annotate(
    geom = "segment",
    x = richest$output - 8000,
    xend = richest$output - 500,
    y = richest$ict_percentage,
    yend = richest$ict_percentage
  )

  • This creates a segment connecting the two points, but not an arrowhead.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right",
  ) +
  annotate(
    geom = "segment",
    x = richest$output - 8000,
    xend = richest$output - 500,
    y = richest$ict_percentage,
    yend = richest$ict_percentage,
    arrow = arrow(type = "closed", length = unit(0.3, "cm"))
  )

  • We can add an arrowhead to the plot using the arrow argument.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right",
  ) +
  annotate(
    geom = "segment",
    x = richest$output - 8000,
    xend = richest$output - 500,
    y = richest$ict_percentage,
    yend = richest$ict_percentage,
    arrow = arrow(type = "closed", length = unit(0.3, "cm")),
    color = "red"
  )

  • Finally, the color of the annotation can be adjusted with the color argument.