Coding & Data Visualization with Generative AI

Last Updated: May 29th, 2025

Everybody is a programmer

Should one learn how to program nowadays?

  • For those appreciating the craft… Yes
  • For those who want to be more productive… Undoubtedly up to recent years, yes.
  • What about today?

Everybody is a programmer

Figure 1: Everybody is a programmer (Jensen Huang; World Governments Summit, 2024).

Impact of Generative AI on Productivity

Cui et al. (2024) studied the impact of generative AI on productivity.

Who benefits the most?

Cui et al. (2024) studied the impact of generative AI on productivity.

How should we interpret the results?

  • Are pull requests, commits, and builds accurate proxies for productivity?
  • Why do senior developers have less pronounced effects?
  • What does this imply for people learning to code?

Preamble

Base R

  1. Download and install Base R.

Visual Studio Code

  1. Download and install Visual Studio Code.

Install R extension

Install R extension

Install R extension

Install languageserver

Install languageserver

Install languageserver

Install Copilot extension

Install Copilot extension

Install Copilot extension

Create course folder codv

Create source file overview.R

Create source file overview.R

Install tidyverse

Install tidyverse

An introduction to data visualization

A Jump Start

  • We use a prepared dataset to directly skip to the visualization part.
eu_ict <- readRDS("data/eu_ict.rds")

A Jump Start

  • In R, datasets are stored as data frames. A data frame is a rectangular arrangement of data. The columns correspond to data variables, and the rows correspond to observations.
  • In this example, we use an enhanced type of data frame called a tibble. Tibbles are a modern re-implementation of data frames that comes with the tidyverse ecosystem.

A Jump Start

  • The eu_ict data frame contains GDP values and occupation percentages in the ICT sector in 32 EU and non-EU countries for the year 2023.

A Jump Start

  • A variable (column in the data frame) is a characteristic of an object or entity that can be measured.

A Jump Start

  • An observation (row in the data frame) is a set of measurements for an object or entity collected under similar conditions.

A Jump Start

  • A value (cell in the data frame) is an instance of a variable for a particular observation.

A first scatter plot

  • We use ggplot2 to visualize the data.
  • The ggplot2 package is a flexible plotting system for R based on the grammar of graphics.
library(ggplot2)

An empty canvas

  • We start by creating an empty canvas for the visualization.
ggplot(data = eu_ict)

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)
  • We specify the axes via the mapping argument.
  • Defining the mapping uses the aes() function.
  • The aes stands for aesthetics.

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)
  • The aes stands for aesthetics.
  • Aesthetics are visual properties (lines, curves, shapes, colors, etc.) of the visualization.

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)
  • The aes stands for aesthetics.
  • The x and y arguments specify the variables to be plotted on the horizontal and vertical axes, respectively.

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)

Adding a layer

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  • Doing so requires adding a layer to the canvas.
  • Adding a layer is done via the + operator.

Adding a layer

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  • This code snippet, however, is incomplete.

Adding a layer

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  • Executing it in an R terminal changes the prompt from > to +.
> ggplot(
+   data = eu_ict,
+   mapping = aes(x = output, y = ict_percentage)
+ ) +

  • This indicates that the R interpreter is waiting for more input.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • There are many functions in ggplot2, starting with geom_, that add different types of layers.
  • For example, geom_line(), geom_bar(), geom_boxplot(), etc.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • The geom prefix stands for geometric object.
  • A geometric object is a visual representation of (a subset of) the data.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • The geom prefix stands for geometric object.
  • The point suffix specifies we want to represent the data as points.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • The pattern is similar for other functions in the geom_ family.
  • For example, geom_line gives geometric representations of the data as lines.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()

Coloring

  • We aim to colorize the points based on the EU membership.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point()
  • The EU variable of the eu_ict data frame is a categorical variable.

Programming digression: factor variables

  • The EU variable of the eu_ict data frame is a categorical variable.
  • Categorical variables in R are stored as factors.
print(eu_ict, n = 3)
# A tibble: 32 × 5
  geo      EU    ict_percentage income output
  <fct>    <fct>          <dbl> <fct>   <dbl>
1 Austria  EU               5.3 high    37860
2 Belgium  EU               5.4 high    37310
3 Bulgaria EU               4.3 low      7900
# ℹ 29 more rows

Programming digression: factor variables

  • The EU variable of the eu_ict data frame is a categorical variable.
  • Categorical variables in R are stored as factors.
print(eu_ict, n = 3)
# A tibble: 32 × 5
  geo      EU    ict_percentage income output
  <fct>    <fct>          <dbl> <fct>   <dbl>
1 Austria  EU               5.3 high    37860
2 Belgium  EU               5.4 high    37310
3 Bulgaria EU               4.3 low      7900
# ℹ 29 more rows

Programming digression: factor variables

  • The EU variable of the eu_ict data frame is a categorical variable.
  • Categorical variables in R are stored as factors.
  • Factor variables have levels that represent the different categories.
  • The EU factor variable has two levels: EU and non-EU.
  • We can colorize the points based on these levels.

Coloring the points

  • We aim to colorize the points based on the EU membership.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point()

Adding curves

  • We aim to add linearly fitted lines to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point() +
  geom_smooth(method = "lm")
  • We add another layer using the + operator.
  • We use geom_smooth to add fitted lines.

Adding curves

  • We aim to add linearly fitted lines to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point() +
  geom_smooth(method = "lm")
  • We use geom_smooth to add fitted lines.
  • The geom_smooth can be used for adding different types of fitted lines.
  • The method argument specifies the type of the fitted line.
  • In this case, we use method = "lm" for linear (model) fitted line.

Adding curves

  • We aim to add linearly fitted lines to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point() +
  geom_smooth(method = "lm")

Adding a single curve

  • We aim to add a single linearly fitted line to the scatter plot.
  • The last code chunk respected the coloring aesthetics we defined earlier.
  • What if we want to add a single fitted line to the scatter plot?

Adding a single curve

  • We aim to add a single linearly fitted line to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU)) +
  geom_smooth(method = "lm")
  • Aesthetics need not be defined globally for all layers.
  • We can define them locally for each layer.
  • We can use aes() to define aesthetics for each geometric object.

Adding a single curve

  • We aim to add a single linearly fitted line to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU)) +
  geom_smooth(method = "lm")

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
  • On some occasions, differentiating points based on colors might not be enough (e.g., printing in black and white).

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
  • Instead, we can use different shapes to represent different categories.
  • Or, we can combine both color and shape aesthetics.

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm")

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm")

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
  • We use the labs() function to add titles and labels.
  • We can additionally modify the legend via labs().

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )

Color scaling

  • We aim to recolor the figure in grayscale.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  ) +
  scale_color_grey()
  • We can use the scale_color_grey() function.
  • There are other color scaling functions available, such as scale_color_brewer(), scale_color_continuous(), scale_color_colorblind(), etc.

Color scaling

  • We aim to recolor the figure in grayscale.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  ) +
  scale_color_grey()

  • The scale_color_grey() function did not recolor the fitted line (why?).
  • We can manually set the color of the fitted line.

Color scaling

  • We aim to recolor the figure in grayscale.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm", color = "darkgray") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  ) +
  scale_color_grey()

Visualizing empirical distributions

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.

Visualizing empirical distributions discretely

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.
  • Histograms and bar plots visualize empirical distributions as a series of bars.
  • Each bar represents a bin of data points.
  • The height of the bar represents the frequency of the data points in the corresponding bin.

Visualizing empirical distributions continuously

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.
  • Density plots visualize empirical distributions as a continuous curve.
  • The curve is an estimate of the population’s probability density function from which the sample was drawn.

Bars or densities?

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.
  • Bars can be used with both continuous (histogram) and categorical (bar plot) variables.
  • Density plots are more suitable for continuous variables.

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()
  • We inform ggplot2 that we want to use the income using aes().
  • Since we only use one variable in the bar plot, we do not need to specify both the x and y aesthetics.

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()
  • We inform ggplot2 that we wish to create a bar plot using the geom_bar() function.

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()

  • The income variable is categorical and has three levels: low, middle, and high.
  • The levels are depicted on the axis used in the aes() call.
  • The other axis depicts the number of observations in each level.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram()
  • We use the function geom_histogram() to create a histogram of a continuous variable.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram()
  • In contrast to categorical variables, for which the number of bins is determined by the number of levels, histograms of continuous variables can be created with different numbers of bins.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram()
  • By default, the geom_histogram() function uses 30 bins.
  • This might not be what we want.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram(bins = 5)
  • We can use the bins argument of the geom_histogram() function to change the default behavior.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram(bins = 5)

Density plots

  • An alternative way to visualize the distribution of a continuous variable is to use a density plot.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_density()
  • We use the function geom_density() to create a density plot.

Density plots

  • An alternative way to visualize the distribution of a continuous variable is to use a density plot.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_density()

Density plots

  • Unlike the histogram case, the vertical axis of the density plot does not measure frequencies of observations.
  • Very roughly, one can think of the density plots as having smoothed-out values of the number of observations over bins with tiny widths.
  • These values can be greater or smaller than 1.

Ordering bar plots levels

  • The bar plot of the income variable we created earlier displays the levels in the order they appear in the data.
  • In some cases, we may want to have bins ordered by their frequency.

Ordering bar plots levels

  • We can change the order in which the levels are displayed using the function fct_infreq.
  • The fct_infreq function reorders the levels of a factor based on their frequency.
  • Other options are fct_inorder, which orders levels in the order they appear in the data, and fct_inseq, which orders levels by the numeric value of their levels.
  • The fct_ family of functions is part of the forcats package (part of the tidyverse).

Ordering bar plots levels

  • We need to load the forcats package to use these functions.
library(forcats)
ggplot(data = eu_ict, mapping = aes(x = fct_infreq(income))) +
  geom_bar()

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
  • How can we achieve this?
  • Creating a bar plot with the EU categorical variable does not provide income information.

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU)) +
  geom_bar()

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU)) +
  geom_bar()
  • We can ask ggplot2 to color the bars by the income variable.

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()
  • Setting the fill aesthetic to income will color each bar according to the number of countries in each income level.

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()
  • One issue with the last plot is that splitting countries based on EU membership is very unbalanced.

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()
  • There are many more EU countries than NON-EU countries in the dataset, making comparisons challenging.

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()
  • We can ask ggplot2 to normalize the bar heights and color by the income shares within each EU membership category.

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar(position = "fill")
  • We can achieve this by setting the position argument of the geom_bar() function to fill.

Coloring bar plots

  • We would like to plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar(position = "fill")

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
  • We can plot the density of ict_percentage using the geom_density() function.
  • One idea is to instruct ggplot2 to colorize based on income using the color aesthetic.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income
  )
) +
  geom_density()

  • We can further highlight the plot’s densities by using a fill aesthetic instead of or alongside color.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income,
    fill = income
  )
) +
  geom_density()

  • We can make the plot easier to read by using a fill aesthetic instead of or alongside color.
  • However, with overlapping densities, it is difficult to distinguish the density shape of each group.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income,
    fill = income
  )
) +
  geom_density()

  • We can pass an alpha (transparency) value to the geom_density() function to make the plot more readable.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income,
    fill = income
  )
) +
  geom_density(alpha = 0.5)

  • We can pass an alpha (transparency) value to the geom_density() function to make the plot more readable.
  • The alpha value ranges from 0 (completely transparent) to 1 (completely opaque).

A first box plot

  • We would like to concisely visualize the basic statistics of ict_percentage per income group.
  • The median, first, and third quartiles are usual statistics of interest.

Box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
  • The box plot is a visualization method that demonstrates the location, spread, skewness, and outliers of a variable.
  • Compared to density plots, box plots explicitly display the median, quartiles, and outliers of a variable.

Box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
  • The first quartile (or the 25th percentile) is the value below which 25% of the data falls.
  • The second quartile (or the median) is the value below which 50% of the data falls.
  • The third quartile (or the 75th percentile) is the value below which 75% of the data falls.
  • The difference between the first and third quartiles is called the interquartile range (IQR).

Box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
  • An outlier is a value that is very different from the remaining values of a variable.
  • There are different thresholds to define outliers, e.g., the values outside the 1st and 99th percentiles, but there is no consensus universally applied in every dataset.
  • By default, box plots in ggplot2 define outliers as values that are more than 1.5 times the IQR below the first quartile or above the third quartile.

Visualizing box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
ggplot(data = eu_ict, mapping = aes(x = income, y = ict_percentage)) +
  geom_boxplot()
  • Creating a box plot in ggplot2 uses the geom_boxplot function.
  • We need to specify the x and y aesthetics to define the variables to be plotted.

Visualizing box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
ggplot(data = eu_ict, mapping = aes(x = income, y = ict_percentage)) +
  geom_boxplot()

Visualizing box plots

  • The median is displayed as a solid horizontal line inside each box.
  • The first and third quartiles are displayed as the lower and upper edges of each box.
  • Whiskers indicate the range of non-outlier values.
  • Outliers are displayed as individual points.
  • Skewness is indicated by the position of the median relative to the quartiles.

Ekphrasis

Ekphrasis

Ekphrasis is a vivid verbal description of, or meditation upon, a non-verbal work of art, real or imagined, usually a painting or sculpture (Baldick 2008)

The limits of ekphrasis

  • In some sense, data science activities are a form of art.
  • A good statistic and its visualization can convey a message and reveal a pattern more effectively than an endless stream of numbers and text.
  • In most ways, however, data science is very different from (at least surreal forms of) art.
  • It has to be grounded and faithful to the data it represents.
  • Unfortunately, governments, academics, and companies sometimes use more than recommended poetic freedom.

The Art of Living (Rene Magritte, 1967)

OxyContin

  • OxyContin is an opioid-based pain medicine introduced by Purdue Pharma in 1996.
  • Purdue Pharma aggressively marketed OxyContin and the use of opioids in the treatment of pain.
  • For example, Purdue Pharma spent $200 million in promotion activities for OxyContin in 2001.

Promotion of OxyContin

  • Van Zee (2009) studied the promotion of OxyContin and its public health implications.
  • Among other activities, Purdue Pharma organized more than 40 conferences for pain management in the US from 1996 to 2001.
  • More than 5000 physicians participated in the conferences.
  • Their expenses were covered by Purdue Pharma.

Promotion of OxyContin

  • Van Zee (2009) studied the promotion of OxyContin and its public health implications.
  • Other marketing activities used data science techniques to influence physicians’ prescribing behavior.
  • Physicians were profiled, and high- and low-prescribers were identified.
  • A bonus system designed to incentivize sales representatives to focus on high-prescribers was introduced.

Commercial Triumph

  • Purdue’s promotion was very successful.
  • OxyContin prescriptions for non-cancer-related pain rose from 670 thousand in 1997 to 6.2 million in 2002 (almost 10x).

Public Health Tragedy

  • However, opioids are known to have a series of serious side effects

    • such as “respiratory depression, sedation, constipation, and nausea; inconsistent improvement in functioning; opioid-induced hyperalgesia; adverse hormonal and immune effects of long-term opioid treatment; a high incidence of prescription opioid abuse behaviors;” (Van Zee 2009)

Misrepresenting the addiction risk

  • In addition, opioids are highly addictive.
  • And a common theme of Purdue’s promotion activities was to soft-pedal the addiction risk for the treatment of chronic non-cancer-related pain.

Misrepresenting the addiction risk

Misrepresenting the addiction risk

The consequences

  • The number of drug-overdose deaths from prescription and illicit opioids rose dramatically in the US from 1999 to 2020.
  • A series of lawsuits against Purdue Pharma followed.
  • In 2007, Purdue paid a $634.5 million fine for an OxyContin-related case.
  • It filed for bankruptcy in 2019.
  • In 2023, Purdue Pharma announced its rebranding as Knoa Pharma.

An introduction to data transformation

Behind the jump start

  • In this section, we will construct a part of the eu_ict dataset from scratch.
  • During this process, we will introduce the basic concepts of data transformation in R.

Transforming data

  • Data transformation is the process of converting data from one format or structure into another.
  • We will use the dplyr and tidyr packages to perform data transformations.
library(dplyr)
library(tidyr)

Transformation types

  • The dplyr package provides three main types of data transformations:
    1. Row-wise transformations apply a function to each row of one or more columns.
    2. Column-wise transformations apply a function to one or more columns of a data frame.
    3. Group-wise transformations apply a function to groups (subsets) of rows.

Transformation types

  • The tidyr package provides two main types of data transformations:
    1. Widening transforms data from long (many rows, few columns) to wide (few rows, many columns) format.
    2. Lengthening transforms data from wide to long format.

Original data sources

Original data sources

  • We will use the readr package to read the data from the original sources.
  • We leave the details of data import operations for a later stage.
library(readr)
sdg <- read_csv("data/estat_sdg_08_10_en.csv")
isoc <- read_csv("data/estat_isoc_sks_itspt_en.csv")
lfsa <- read_csv("data/estat_lfsa_egan_en.csv")

A first look at the GDP data

  • We inspect the sdg data frame.
sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • The output informs us that sdg is stored as a tibble.

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • It has 1,845 rows and 9 columns.

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • The first eight columns are DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, TIME_PERIOD, OBS_VALUE.

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • Their corresponding types are character, character, character, character, character, character, numeric, numeric.

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • The output displays the first 10 rows of the data frame.

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • The rows are enumerated starting from 1.

Reading tibble’s from standard output

sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
  • Finally, we are informed that the data frame has 1 more variable, OBS_FLAG, with type character.

The GDP dataset variables

  • We can learn more about the dataset’s variables by visiting the Eurostat website, using the DOI of (Eurostat 2025c).
  • The variables unit, geo, TIME_PERIOD, OBS_VALUE, and OBS_FLAG contain information of interest.
  • The variables DATAFLOW, LAST UPDATE, freq, and na_item contain metadata and have no variation within the dataset.

Finding distinct values

  • The variables DATAFLOW, LAST UPDATE, freq, and na_item contain metadata and have no variation within the dataset.
  • We can verify this using the dplyr function distinct(), and, at the same time, perform our first data transformation operation.
distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

Finding distinct values

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • The function distinct() returns the rows of the data frame with unique combinations of the specified columns.
  • The first argument of the distinct() is the data frame.
  • The remaining arguments are the columns for which we want to find distinct values.

Finding distinct values

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • We can verify that there is only one unique combination of DATAFLOW and LAST UPDATE values in the sdg dataset.

A closer look at distinct()’s result

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • The result of distinct() contains only the specified columns.
  • We can instruct distinct() to keep all columns by using the .keep_all argument.
  • Then, distinct() returns the first row of each unique combination of the specified columns.

A closer look at distinct()’s result

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item, .keep_all = TRUE)
# A tibble: 1 × 9
  DATAFLOW         `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
  <chr>            <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
1 ESTAT:SDG_08_10… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
# ℹ 1 more variable: OBS_FLAG <chr>
  • We can instruct distinct() to keep all columns by using the .keep_all argument.
  • Then, distinct() returns the first row of each unique combination of the specified columns.

A closer look at distinct()’s arguments

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • In the distinct() call, the LAST UPDATE column is enclosed in backticks.
  • This is because the LAST UPDATE column name contains a space.
  • Variable names in R cannot contain spaces.
  • While such names are commonly found in datasets in the wild, they are non-syntactic names in R.

A closer look at distinct()’s arguments

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • In the distinct() call, the LAST UPDATE column is enclosed in backticks.
  • We enclose non-syntactic names in backticks to use them in dplyr calls or other R operations.

Programming digression: the pipe operator

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • We have called distinct() using the usual parenthesized notation.
  • I.e., for any function \(f\) and arguments \(x_{1}\), \(x_{2}\), \(\ldots\), \(x_{n}\), we write \(f(x_{1}, x_{2}, \ldots, x_{n})\).

Programming digression: the pipe operator

sdg |> distinct(DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • However, the most common way to call dplyr functions is by using the pipe operator |>.
  • I.e., for any function \(f\) and arguments \(x_{1}\), \(x_{2}\), \(\ldots\), \(x_{n}\), we write \(x_{1} |> f(x_{2}, \ldots, x_{n})\).

Programming digression: the pipe operator

sdg |> distinct(DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • The pipe operator takes the output of the left-hand side operation and passes it as the first argument of the right-hand side call.

Programming digression: the pipe operator

sdg |>
  distinct(DATAFLOW, `LAST UPDATE`, freq, na_item)
# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…
  • The pipe operator takes the output of the left-hand side operation and passes it as the first argument of the right-hand side call.
  • It is customary to insert a newline after the pipe operator to improve readability.

Programming digression: the pipe operator

  • The pipe operator shines when we chain multiple operations.
function1(
  function2(
    function3(
      function4(
        data,
        arg4_1,
        arg4_2
      ),
      arg3_1
    ),
    arg2_1,
    arg2_2,
    arg2_3
  ),
  arg1_1
)
data |>
  function4(arg4_1, arg4_2) |>
  function3(arg3_1) |>
  function2(arg2_1, arg2_2, arg2_3) |>
  function1(arg1_1)

Selecting variables of interest

  • We have established that variables DATAFLOW, LAST UPDATE, freq, and na_item are not very relevant for our analysis.
  • We want to focus on the remaining variables.
  • We can use the select() function to keep only the variables we are interested in.

Selecting variables of interest

  • We can use the select() function to keep only the variables we are interested in.
  • There are multiple ways to achieve the desired result.

Explicit selection

  • First, we can explicitly list the variables we want to keep.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG)
# A tibble: 1,845 × 5
   unit                                     geo   TIME_PERIOD OBS_VALUE OBS_FLAG
   <chr>                                    <chr>       <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per c… Alba…        2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per c… Alba…        2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per c… Alba…        2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per c… Alba…        2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per c… Alba…        2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per c… Alba…        2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per c… Alba…        2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per c… Alba…        2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per c… Alba…        2008      2850 d       
10 Chain linked volumes (2010), euro per c… Alba…        2009      2960 <NA>    
# ℹ 1,835 more rows

Vector selection

  • Second, we can concatenate the variables we wish to keep in a vector and pass it to the select() function.
  • In R, we concatenate objects using the c() function.
  • For example:
c(1, 2, 3, 4)
[1] 1 2 3 4
c(1.2, 10 / 3, pi)
[1] 1.200000 3.333333 3.141593
c("a", "b", "c")
[1] "a" "b" "c"

Vector selection

  • Second, we can concatenate the variables we wish to keep in a vector and pass it to the select() function.
sdg |>
  select(c(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG))
# A tibble: 1,845 × 5
   unit                                     geo   TIME_PERIOD OBS_VALUE OBS_FLAG
   <chr>                                    <chr>       <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per c… Alba…        2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per c… Alba…        2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per c… Alba…        2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per c… Alba…        2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per c… Alba…        2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per c… Alba…        2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per c… Alba…        2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per c… Alba…        2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per c… Alba…        2008      2850 d       
10 Chain linked volumes (2010), euro per c… Alba…        2009      2960 <NA>    
# ℹ 1,835 more rows

Vector selection

  • Second, we can concatenate the variables we wish to keep in a vector and pass it to the select() function.
sdg |>
  select(c(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG))
  • This approach is not very useful when we are listing variables we want to keep because using c() is superfluous.
  • However, it is useful when we intend to exclude variables.

Selection via exclusion

  • Third, we can exclude the variables we would like to drop.
sdg |>
  select(!c(DATAFLOW, `LAST UPDATE`, freq, na_item))
# A tibble: 1,845 × 5
   unit                                     geo   TIME_PERIOD OBS_VALUE OBS_FLAG
   <chr>                                    <chr>       <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per c… Alba…        2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per c… Alba…        2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per c… Alba…        2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per c… Alba…        2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per c… Alba…        2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per c… Alba…        2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per c… Alba…        2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per c… Alba…        2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per c… Alba…        2008      2850 d       
10 Chain linked volumes (2010), euro per c… Alba…        2009      2960 <NA>    
# ℹ 1,835 more rows

A closer look at select()’s results

  • The select() function operates on the columns of a data frame.
  • It returns a new data frame with only the columns specified in the function call.
  • It does not change the rows of the data frame.
  • We can verify this using the nrow() function, which returns the number of rows in a data frame.
nrow(sdg)
[1] 1845
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  nrow()
[1] 1845

Renaming variables

  • The variable TIME_PERIOD is among the variables we selected.
  • The values of this variable indicate the year in which the GDP values of an observation were recorded.
  • How can we rename this variable to year?

Using rename()

  • The rename() function of package dplyr can be used to rename variables.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD)
  • We use a keyword argument to specify the operation we want to perform.
  • The new name of the variable is on the left-hand side of =.
  • The old name of the variable is on the right-hand side of =.

Using rename()

  • The rename() function of package dplyr can be used to rename variables.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD)
# A tibble: 1,845 × 5
   unit                                         geo      year OBS_VALUE OBS_FLAG
   <chr>                                        <chr>   <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per capita Albania  2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per capita Albania  2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per capita Albania  2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per capita Albania  2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per capita Albania  2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per capita Albania  2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per capita Albania  2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per capita Albania  2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per capita Albania  2008      2850 d       
10 Chain linked volumes (2010), euro per capita Albania  2009      2960 <NA>    
# ℹ 1,835 more rows

Using rename()

  • We can use rename() to rename multiple variables at once.
  • For example, we can simultaneously rename TIME_PERIOD and OBS_FLAG to year and flag, respectively.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG)

Using rename()

  • We can use rename() to rename multiple variables at once.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG)
# A tibble: 1,845 × 5
   unit                                         geo      year OBS_VALUE flag 
   <chr>                                        <chr>   <dbl>     <dbl> <chr>
 1 Chain linked volumes (2010), euro per capita Albania  2000      1700 <NA> 
 2 Chain linked volumes (2010), euro per capita Albania  2001      1850 <NA> 
 3 Chain linked volumes (2010), euro per capita Albania  2002      1940 <NA> 
 4 Chain linked volumes (2010), euro per capita Albania  2003      2060 <NA> 
 5 Chain linked volumes (2010), euro per capita Albania  2004      2180 <NA> 
 6 Chain linked volumes (2010), euro per capita Albania  2005      2310 <NA> 
 7 Chain linked volumes (2010), euro per capita Albania  2006      2460 <NA> 
 8 Chain linked volumes (2010), euro per capita Albania  2007      2630 <NA> 
 9 Chain linked volumes (2010), euro per capita Albania  2008      2850 d    
10 Chain linked volumes (2010), euro per capita Albania  2009      2960 <NA> 
# ℹ 1,835 more rows

Mutating variables

  • If we examine the unit variable in the resulting data frame, we observe it takes two distinct values.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  distinct(unit)
# A tibble: 2 × 1
  unit                                                                  
  <chr>                                                                 
1 Chain linked volumes (2010), euro per capita                          
2 Chain linked volumes, percentage change on previous period, per capita
  • These values are very informative, but they might be a bit verbose for our analysis.

Using mutate()

  • The dplyr’s mutate() function allows us to modify existing variables or create new variables based on the existing ones.

  • We can use mutate() by passing one or more keyword arguments that specify the desired transformations.

Specifying a mutate() transformation

  • In our case, we want to use mutate to map the values of the unit such that:
    • “Chain linked volumes, percentage change on previous period, per capita” is mapped to “growth”, and
    • “Chain linked volumes (2010), euro per capita” is mapped to “output”.

Specifying a mutate() transformation

  • Equivalently, we want to use mutate to map the values of the unit such that for each value \(x\)
    • \(f(x)\) = “growth” if x = “Chain linked volumes, percentage change on previous period, per capita”, and
    • \(f(x)\) = “output” if x = “Chain linked volumes (2010), euro per capita”.

Specifying a mutate() transformation

  • Equivalently, we want to use mutate to map the values of the unit such that for each value \(x\)
    • \(f(x)\) = “growth” if x contains the word “percentage”, and
    • \(f(x)\) = “output” otherwise.

Specifying a mutate() transformation

  • In pseudocode, we can procedurally describe our transformation as follows:
if x contains "percentage":
  return "growth"
else
  return "output"
  • Conveniently, with some preparation, we can write this transformation as a single-liner in R!

Programming digression: using grepl()

  • We can use the (obscurely named?) base R function grepl() to check if a value contains the word “percentage”.
  • This is a powerful function that can be used in many ways, but for now:
    • The function receives an argument pattern that specifies the word we are looking for, and
    • an argument x that specifies the value we are examining.
    • It returns TRUE if the pattern is found, and FALSE otherwise.

Programming digression: using grepl()

  • For example:
grepl(pattern = "percentage", x = "does not have it")
[1] FALSE
grepl(pattern = "percentage", x = "has percentage")
[1] TRUE
  • If one remembers the correct order of the arguments, their keywords can be omitted, i.e.,
grepl("percentage", "does not have it")
[1] FALSE
grepl("percentage", "has percentage")
[1] TRUE

Programming digression: using grepl()

  • With grepl(), we can rewrite our pseudocode in syntactically valid R code:
if (grepl("percentage", x)) {
  "growth"
} else {
  "output"
}
  • However, the last code snippet contains an if-else control structure, which cannot be directly used in mutate().

Programming digression: using ifelse()

  • We can use the base R function ifelse(), which provides a functional form of the if-else control structure.
  • The function receives three arguments:
    • The first argument is a logical value.
    • The second argument is the value to return if the logical value is TRUE.
    • The third argument is the value to return if the logical value is FALSE.

Programming digression: using ifelse()

  • For example:
ifelse(
  grepl("percentage", "does not have it"),
  "Found it!",
  "Nope"
)
[1] "Nope"
ifelse(
  grepl("percentage", "has percentage"),
  "Found it!",
  "Nope"
)
[1] "Found it!"

Programming digression: using ifelse()

  • With grepl() and ifelse combined, we can rewrite our pseudocode in a single-line syntactically valid R code:
ifelse(grepl("percentage", x), "growth", "output")

A first mutate() transformation

  • We are now ready to use mutate() to simplify the values of the unit variable.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output"))
  • The mutate() function receives a keyword argument that specifies the transformation we want to perform.
  • Similarly to rename(), the new name or the modified variable name is at the left-hand side of =.
  • The transformation is specified at the right-hand side of = and can use one or more existing variables.

A first mutate() transformation

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output"))
# A tibble: 1,845 × 5
   unit   geo      year OBS_VALUE flag 
   <chr>  <chr>   <dbl>     <dbl> <chr>
 1 output Albania  2000      1700 <NA> 
 2 output Albania  2001      1850 <NA> 
 3 output Albania  2002      1940 <NA> 
 4 output Albania  2003      2060 <NA> 
 5 output Albania  2004      2180 <NA> 
 6 output Albania  2005      2310 <NA> 
 7 output Albania  2006      2460 <NA> 
 8 output Albania  2007      2630 <NA> 
 9 output Albania  2008      2850 d    
10 output Albania  2009      2960 <NA> 
# ℹ 1,835 more rows

Filtering data

  • A filtering operation is a transformation that removes rows from a dataset based on a logical condition.

Filtering data

  • The so-far transformed data frame contains observations for multiple years.
  • We can use the base R function range() to find the range of years in the dataset.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  distinct(year) |>
  range()
[1] 2000 2023
  • If we want to focus only on the latest year for analysis, how can we remove the observations of the other years?

Using filter()

  • The filter() function of the dplyr package can be used to filter rows.
  • The function takes a logical condition as an argument.
  • The condition is evaluated for each row in the dataset.
    • If the condition is TRUE, the row is kept.
    • Otherwise, the row is filtered out.

Programming digression: numerical logical conditions in R

  • Basic numerical logical conditions in R can be specified using the following operators:
    • == (equal to)
    • != (not equal to)
    • > (greater than)
    • >= (greater than or equal to)
    • < (less than)
    • <= (less than or equal to)

Specifying a filtering condition

  • In our case, we want to keep only the rows where the year variable is equal to the maximum year in the dataset.
  • Namely, we aim to keep only the observations with year equal to 2023.
  • Or syntactically, we aim to keep rows that satisfy the condition year == 2023.

A first filtering operation

  • Thus, we can filter out observations with years apart from 2023 as follows:
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023)
# A tibble: 72 × 5
   unit   geo                                    year OBS_VALUE flag 
   <chr>  <chr>                                 <dbl>     <dbl> <chr>
 1 output Austria                                2023     37860 <NA> 
 2 output Belgium                                2023     37310 p    
 3 output Bulgaria                               2023      7900 <NA> 
 4 output Switzerland                            2023     63870 p    
 5 output Cyprus                                 2023     29080 p    
 6 output Czechia                                2023     18480 <NA> 
 7 output Germany                                2023     36290 p    
 8 output Denmark                                2023     52510 <NA> 
 9 output Euro area - 19 countries  (2015-2022)  2023     32340 <NA> 
10 output Euro area – 20 countries (from 2023)   2023     32150 <NA> 
# ℹ 62 more rows

Arranging data

  • Arranging transformations are row-wise operations that change the order of rows in a dataset based on the values of one or more columns.

An arranging example

  • Suppose we want to arrange the data of the so-far transformed dataset so that observations appear in ascending (lexicographic) order with respect to the country names.
  • We can use the arrange() function from the dplyr package.

Using arrange()

  • The arrange() function expects one or more column names as arguments.
  • First, it sorts the dataset by the first provided column.
  • If there are ties, it sorts by the second column, and so on.
  • By default, observations are sorted in ascending order.

Using arrange()

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo)
# A tibble: 72 × 5
   unit   geo       year OBS_VALUE flag 
   <chr>  <chr>    <dbl>     <dbl> <chr>
 1 output Austria   2023   37860   <NA> 
 2 growth Austria   2023      -1.8 <NA> 
 3 output Belgium   2023   37310   p    
 4 growth Belgium   2023       0.4 p    
 5 output Bulgaria  2023    7900   <NA> 
 6 growth Bulgaria  2023       2.2 <NA> 
 7 output Croatia   2023   15020   p    
 8 growth Croatia   2023       2.7 p    
 9 output Cyprus    2023   29080   p    
10 growth Cyprus    2023       1   p    
# ℹ 62 more rows

Using arrange()

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit)
# A tibble: 72 × 5
   unit   geo       year OBS_VALUE flag 
   <chr>  <chr>    <dbl>     <dbl> <chr>
 1 growth Austria   2023      -1.8 <NA> 
 2 output Austria   2023   37860   <NA> 
 3 growth Belgium   2023       0.4 p    
 4 output Belgium   2023   37310   p    
 5 growth Bulgaria  2023       2.2 <NA> 
 6 output Bulgaria  2023    7900   <NA> 
 7 growth Croatia   2023       2.7 p    
 8 output Croatia   2023   15020   p    
 9 growth Cyprus    2023       1   p    
10 output Cyprus    2023   29080   p    
# ℹ 62 more rows

Data pivoting

  • Information is often not uniquely represented by a single data frame structure.
  • Two or more data structures can represent the same information.
  • While from a conceptual standpoint, this poses no problem as the information is the same, from a practical standpoint, the representation can be very impactful.

Data frame structures

  • What is the most suitable data frame structure?
  • To answer this question, it is helpful to contemplate the following two questions:
    1. Which data frame structure is most convenient for the analysis?
    2. Which data frame structure is most appropriate for the computational environment?

Data frame structures

  1. Which data frame structure is most convenient for accessing information relevant to the analysis?
  • The answer to the first question is case-specific.
  • As a general rule of thumb, the most convenient data frame structure is the one that allows for the most straightforward access to the information relevant to the analysis.

Data frame structures

  1. Which data frame structure is most appropriate for the computational environment?
  • The answer to the second question depends on the tools we use.
  • Because R is vectorized (a vector-oriented language), it is more efficient to use a column-oriented data frame structure.
  • I.e., it is usually a good idea to have the information we use in computations frequently as columns (variables).

Wide and long formats

  • Examining the so-far transformed data frame once more, we observe that each country appears in two rows.
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)
  • Here, we use print(n = 5) to display only the first five (instead of ten) rows and save some space.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)
# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)
# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows
  • One row holds growth data in OBS_VALE.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)
# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows
  • The other holds output data in OBS_VALE.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)
# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows
  • The data is in the so-called long format.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)
# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows
  • We want to transform the data into a wide format so that each variable appears in a separate column.

Pivoting, widening, and lengthening

  • The process of transforming data from long to wide format or vice versa is called data pivoting.
  • Transforming data from long to wide format is called widening.
  • Transforming data from wide to long format is called lengthening.

Pivoting, widening, and lengthening

  • In R, we can easily pivot data using the tidyr package.
    • We can use the pivot_wider() function to pivot data from long to wide format.
    • We can use the pivot_longer() function to pivot data from wide to long format.

A widening example

  • In our case, we want to widen the data, so we will use pivot_wider().
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  pivot_wider(names_from = unit, values_from = OBS_VALUE)
  • Using pivot_wider() requires that we specify two arguments:
    1. Existing variable(s) that hold column names (names_from),
    2. Existing variable(s) that hold values (values_from).

A widening example

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  pivot_wider(names_from = unit, values_from = OBS_VALUE)
# A tibble: 36 × 5
   geo                                    year flag  output growth
   <chr>                                 <dbl> <chr>  <dbl>  <dbl>
 1 Austria                                2023 <NA>   37860   -1.8
 2 Belgium                                2023 p      37310    0.4
 3 Bulgaria                               2023 <NA>    7900    2.2
 4 Switzerland                            2023 p      63870   -0.8
 5 Cyprus                                 2023 p      29080    1  
 6 Czechia                                2023 <NA>   18480   -1.2
 7 Germany                                2023 p      36290   -1.1
 8 Denmark                                2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2
# ℹ 26 more rows

Numbers do not lie

Numbers do not lie

  • Today’s science is empirical.
  • Even for too complicated underlying systems, data-driven analyses can help us understand underlying relationships.
  • And provide insights with clarity that are alternatively unattainable.

Numbers do not lie

  • Today’s science is empirical.
  • In practice, however, data analysis is only as impartial as the person conducting it.
  • Similar to other technological advancements, data analysis can be used for deception as much as it can be used for enlightenment.

Many shades of wrong

  • Aragão and Linsi (2022) studied cases where governments used statistics to misrepresent or opportunistically interpret data.
  • And introduced a data manipulation typology:
    • outright manipulation (type 1),
    • politically motivated guesstimating (type 2),
    • the opportunistic use of methodology space (type 3),
    • and indicators-management through indirect means (type 4).

Many shades of wrong

  • Aragão and Linsi (2022) studied three high-profile cases:
    • Argentinian government’s inflation statistics between 2007 and 2015,
    • Brazilian debt figures between 2012 and 2015, and
    • Greece’s public finance statistics in the 2000s.

Many shades of wrong

  • Aragão and Linsi (2022) studied three high-profile cases:
    • Argentinian government’s inflation statistics between 2007 and 2015,
    • Brazilian debt figures between 2012 and 2015, and
    • Greece’s public finance statistics in the 2000s.

Argentinian inflation statistics

  • After the 2001 default, the government introduced the “Coeficiente de Estabilización de Referencia” (CER) mechanism.
  • Creditors had the option to exchange foreign currency debt for CER-denominated bonds.
  • CER was based on a daily inflation index calculated by the National Institute of Statistics and Censuses (INDEC).
  • By 2007, 39% of the public debt was CER-denominated.

Argentinian inflation statistics

  • Problem: High inflation rates led to greater interest payments.

Argentinian inflation statistics

  • In 2005, Guillermo Moreno took the office of Secretary of Domestic Trade, appointed by President Néstor Kirchner.
  • In May 2006, Moreno met INDEC technicians, inquiring how the inflation indicator was being measured.
  • He asked for particular details of shops and products used in the index (why?).
  • Later, he argued that the current methodology of calculating the index was unpatriotic! (why?)

Argentinian inflation statistics

  • In January 2007, prices of lettuce and pre-paid mobile cards exhibited unexpected variation in the data.
  • Moreno argued that this unexpected variation had implications for the index calculation.
  • The government replaced the INDEC director, and the data and methodology used to calculate the index were revised.
  • The changes dropped inflation measures… but not enough, and not for long.

Argentinian inflation statistics

  • Gradually, all products with price variations greater than 15% were removed from the index!

Argentinian inflation statistics

Source: Aragão and Linsi (2022)

Even in the best of families

  • Cases of data manipulation (of any type) are not limited to governments and politicians.
  • High-profile cases have also been discovered in academia (see Rebel talent case).
  • And the private sector (see Ekphrasis case).

An introduction to importing data

Back at the sources

  • We used the readr package to load data from CSV files.
  • In this topic, we will take a bit closer look at the readr package and its functions.

A look at the CSV files

  • A very common way for sharing statistical data is via the Comma Separated Values (CSV) file format.
  • CSV data has a rectangular shape.
  • More often than not, they come with a header row specifying the column names.
  • Values are separated by a comma.
  • Alphanumeric values are usually, but not always, enclosed in double quotes.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • We observe that the estat_sdg_08_10_en.csv file has a header row, from which the initial column names of the transformation overview topic’s data frame were inferred.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • Values in each row are separated by commas.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • The values of some columns are not enclosed in double quotes.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • While values in other columns are enclosed in double quotes.
  • Observe that the values of these columns contain commas in their text.
  • We enclose such values in double quotes to avoid confusion with the column separator of the CSV file.

Importing CSV files

  • We can easily import CSV-formatted data into R using the read_csv() function from the readr package.
library(readr)

Importing data with read_csv()

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(file = "data/estat_sdg_08_10_en.csv")
  • In this case, we indicate to read_csv() that the file is stored locally on our computer, in the data directory.
  • We can specify the file location as an absolute path, as a relative path, or as a web address.
  • In this case, we used a relative path.

Relative paths

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(file = "data/estat_sdg_08_10_en.csv")
  • Relative paths are resolved based on the current working directory.
  • For instance, if the current working directory is C:/Users/user, then the relative path we supplied points to the file:

C:/Users/user/data/estat_sdg_08_10_en.csv

Absolute paths

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(file = "/home/user/data/estat_sdg_08_10_en.csv")
  • Absolute paths are resolved based on the root directory of the file system.
  • If the file argument we supply starts with a /, then read_csv interprets it as an absolute path.
  • The file argument points to the exact location of the file in the file system.

Web addresses

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(
  file = "https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/sdg_08_10/1.0"
)
  • Web addresses refer to files stored on remote servers.
  • If the file argument we supply starts with a protocol (e.g., http:// or https://), then read_csv interprets it as a web address.
  • The file is downloaded from the web and imported into R.

Reading CSV data

  • Calling read_csv() returns a tibble (data frame) with the contents of the CSV file.
  • We can assign the result to a variable to store the data frame in memory.
  • Assigning objects to variable names in R is done with the assignment operator: <-.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")

Reading CSV data

  • Calling read_csv() returns a tibble (data frame) with the contents of the CSV file.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
  • We can read the last expression from right to left as “the result of reading the CSV file is assigned to the variable sdg.”

Reading CSV data

  • By assigning the data to the sdg variable, we do not need to read the CSV file again to access the data.
sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
read_csv(file = "data/estat_sdg_08_10_en.csv")
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • We are informed that 1845 rows and 9 columns were imported.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • The delimiter used to separate values is a comma.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Seven columns were parsed as character vectors (chr).
  • Two columns were parsed as double precision numbers (dbl). We can roughly consider them as real numbers.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • We can get more information about the data types using spec().
  • We can suppress this message with a couple of options.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv", show_col_types = FALSE)
  • One way is to pass show_col_types = FALSE to read_csv().
  • Another way is to manually specify (some of the) column types.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • Column types are specified as a list in the col_types argument.
  • Each item in the list specifies a column name and a corresponding type.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • The unit column is parsed as a factor (categorical variable).

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • The TIME_PERIOD column is parsed as an integer variable.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • The LAST UPDATE column is parsed as a date-time variable.

Importing files with other formats

  • The readr package provides a family of functions to import data from various file formats.
  • Their calling conventions are similar to read_csv().

Comma-separated data formats

Function Data Format Description
read_csv() CSV Reads comma-separated values
read_csv2() CSV Reads semicolon-separated (;) values
read_tsv() TSV Reads tab-separated values
read_delim() Delimited Separated Values Reads values separated by a custom delimiter

Other data formats

Function Data Format Description
read_fwf() Fixed Width Format Reads fixed-width formatted values
read_table() Table Special case of fixed-width format
read_log() Log files Reads log files

An introduction to modern development tools

Integrated Development Environment

  • An Integrated Development Environment (IDE) is a software application providing tools to facilitate programming activities.

Overview

  • There are many IDEs with varying feature sets.

Overview

  • IDEs can specialize in a particular programming language or support multiple languages.
  • Although the provided features vary across IDEs and languages, at a minimum, modern IDEs provide tools for editing, executing, and debugging code.

More than a text editor

A long time ago in a galaxy far, far away…

More than a text editor

  • Code editing was done in a text editor, generating source code files.
  • Source code was compiled with a software application called the compiler, generating machine language files.
  • Machine language files were linked among them and with other pre-existing libraries via a software application called the linker, generating an executable program.
  • If the program did not execute as expected, it was passed to a software application called the debugger to identify potential issues.

More than a text editor

  • The (development) process of editing, translating, executing, and debugging was repeated until the program was working as expected.
  • During these iterations, one had to switch between different environments and tools many times.
  • The idea of IDEs is to integrate all needed tools in a single environment to increase productivity.

Modern IDEs

  • Modern IDEs enhanced the set of provided features beyond the development cycle.
  • Code navigation.
  • Code completion.
  • Code refactoring and renaming.
  • Code formatting.
  • And many more…

Modern IDEs

  • And many more…
  • So many that programming with an IDE is an entirely different experience than programming in the same language without one or with a different one.
  • The efficiency of working with an IDE led to a high number of different IDEs.
  • IDE providers strived to include as many modern features as possible, not to stay behind competition.

Modern IDEs

  • Further, with more and more programming languages and frameworks being developed, the number of times that the same feature had to be implemented grew quadratically.

Modern IDEs

  • For \(L\) languages and \(I\) IDEs, every feature had to be implemented \(L \times I\) times.

Modern IDEs

  • Besides work duplication, not all feature implementations were identical across languages and IDEs.
  • So what if one learns to program with an IDE and then works in a company that uses a different one?

Language Server Protocol

A new hope

What is the LSP?

  • The Language Server Protocol (LSP) is a specification of communication rules between IDEs and language servers.
  • Originally developed by Microsoft.
  • In 2016, Microsoft partnered with Red Hat and Codenvy to develop an open standard for the LSP.
  • Today, the LSP is largely adopted by most IDEs and programming languages.

Why was it so successful?

  • By specifying the communication rules, the LSP reduces the \(L \times I\) to an \(L + I\) implementation problem.

Why was it so successful?

  • Programming languages implement language servers that understand the LSP.
  • IDEs implement clients that understand the LSP.
  • No need for feature implementation duplication.

Why should I use care?

  • Added benefit: More consistent feature implementations across IDEs.
  • Learning to program with an IDE that supports the LSP makes it easier to switch to another IDE that also supports the LSP.

LSP in R?

  • The languageserver package provides an LSP implementation for the R programming language.
  • It supports:
    • Code completion.
    • Code navigation.
    • Code formatting (via the styler package).
    • Code refactoring.
    • Code linting (via the lintr package).

AI coding assistants

The force awakens

Large Language Models

  • Large language models are machine learning models for natural language processing.
  • They receive as input a sequence of tokens (usually words) and output a sequence of tokens.
  • Their output is one of the most likely sequences of tokens given the input.
  • Since programs are sequences of keywords and symbols, one of the most successful applications of large language models is code generation.

Large Language Models

  • There is enormous commercial interest for companies if AI can generate safe and efficient code less costly than humans.
  • However, whether this is indeed feasible is not straightforward to answer.
  • Unlike humans, generative AI models do not generate code based on requirements but based on the statistical correlation of what is more likely to follow.
  • Asking a generative AI model to generate code for the same task multiple times results in different outputs.

Reproducibility

Reproducibility

  • Every time we query the model, it generates a solution from scratch.
  • Humans generating code for the same task a second time are more likely to work on enhancing the existing code instead of starting from scratch.

Reproducibility

  • This is, perhaps, not a big issue for small programming tasks.
  • But what if your task involves thousands of lines of code distributed across multiple files?
  • Is it feasible to examine and deal with the complexity of the generated code every time from scratch?
  • This raises doubts about the long-term maintainability of AI-generated code.

Hallucinations

  • Another issue with generative AI models is hallucinations.
  • Hallucinations in code generation manifest in a few ways.

Hallucinations

  • Hallucinations in code generation manifest in a few ways.
  • When working with self-developed or niche libraries, the model may not have seen enough examples to generate correct code.
  • It is likely to generate code that is syntactically correct, but it involves function calls and module imports that do not exist.

Hallucinations

  • Hallucinations in code generation manifest in a few ways.
  • When you have a logical error in your code, the model may generate code that is syntactically correct but reinforces or replicates the logical error.
  • This is because the model does not generate code based on requirements but based on what is more likely to follow what you have already written.

Hallucinations

var1 <- 1
var2 <- 2
var3 <- 3

first_task(var1)
second_task(var1) # logical error: var2 was supposed to be used here
third_task(var1) # potential completion suggestion

Context awareness

  • Another issue commonly encountered with coding assistants is the lack of context awareness.
  • Many implementations can correctly solve a programming task.
  • In addition, all of them can be equally efficient, safe, and maintainable.
  • However, not all implementations are equally appropriate for all contexts.

Context awareness

  • Another issue commonly encountered with coding assistants is the lack of context awareness.
  • Some implementations may fit better when paired with other parts of the code and the overarching goals of the project.
  • Nonetheless, coding assistants do not have information about the project’s goals or the rest of the codebase in all cases.
  • Eventually, the evaluation of the appropriateness of a generated solution remains a human task.

Working with AI coding assistants

  • Consequently, AI coding assistants fundamentally change the way we program.
  • Researching a solution:
    • Without: More tedious and time-consuming. Reading documentation, searching for existing solutions, implementations, and libraries.
    • With: Automatically generated.

Working with AI coding assistants

  • Consequently, AI coding assistants fundamentally change the way we program.
  • Editing code:
    • Without: More manual and slow. Omissions, logical errors, and typos can creep in.
    • With: More automated. Omissions, logical errors, and typos can still creep in.

Working with AI coding assistants

  • Consequently, AI coding assistants fundamentally change the way we program.
  • Reviewing and debugging code:
    • Without: Easier to review self-written code because the logic is known.
    • With: Harder to review. Need to understand the logic. Need to understand how used functions and modules work (documentation).

No free lunch

  • Consequently, AI coding assistants fundamentally change the way we program.
  • Overall, working with AI coding assistants removes responsibilities from the research stage but creates new ones in the reviewing stage.

More on data transformation

Prerequisites

  • We examine some aspects of dplyr in more detail.
library(dplyr)

Prerequisites

sdg <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

Programming digression: R namespaces

  • The snippet for importing the sdg data slightly differs from the code we used in the data import overview topic.

Programming digression: R namespaces

sdg <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • vs:
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

Programming digression: R namespaces

  • We have modified all calls to the readr package’s functions!
    • read_csv \(\rightarrow\) readr::read_csv
    • col_factor \(\rightarrow\) readr::col_factor
    • col_integer \(\rightarrow\) readr::col_integer
    • col_datetime \(\rightarrow\) readr::col_datetime
  • However, we have never loaded the readr package!
  • And the two snippets execute exactly the same underlying operations!

Programming digression: R namespaces

  • Many programming languages, but not all, have a namespace feature.
  • A namespace is a collection of identifiers, such as function names, variable names, and other symbols, within a single scope (or context).

Programming digression: R namespaces

  • Organizing and managing function and object names in namespaces is a way to avoid conflicts between different modules.

Programming digression: R namespaces

  • Each package in R has its own namespace.
  • When importing a package with library(), all the functions and objects in the package’s namespace become available globally.
  • This is why, after library(dplyr), we can use select(), filter(), etc., without specifying the package.

Programming digression: R namespaces

  • Why would one want to use readr::read_csv instead of read_csv?
    • After all, read_csv is shorter and easier to type.
  1. Readability: It is easier to understand the code.
  2. Avoiding conflicts: If two packages have functions with the same name, you can specify which one you want to use.

Programming digression: R namespaces

  1. Readability: It is easier to understand the code.
  • First, readr::read_csv informs directly the reader where the read_csv function comes from.
  • This is relevant both for your coworkers when you share your code with them and for your future self reading the code.
  • There is no need to scroll up to see which packages are loaded.

Programming digression: R namespaces

  1. Avoiding conflicts: If two packages have functions with the same name, you can specify which one you want to use.
  • Let’s revisit what happens when we load the dplyr package for the first time in an R session.

Programming digression: R namespaces

  • Let’s revisit what happens when we load the dplyr package for the first time in an R session.
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Programming digression: R namespaces

  • If instead of loading dplyr, we use its functions with the dplyr:: prefix, we avoid masking functions from other namespaces.
  • This is less error-prone because we avoid scenarios where we loaded a package and forgot it masks a function from another package we also use.
  • However, it makes dependency management a bit more cumbersome because the loaded packages are not listed at a single point in the code.

Creating boolean variables

  • A brief reminder of where we left off:
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)
  • We have used mutate() to modify the values of the unit column.
  • Now, we will use mutate() to create the new columns.

Storing an intermediate result

  • A brief reminder of where we left off:
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)
# A tibble: 36 × 5
   geo                                    year flag  output growth
   <chr>                                 <int> <chr>  <dbl>  <dbl>
 1 Austria                                2023 <NA>   37860   -1.8
 2 Belgium                                2023 p      37310    0.4
 3 Bulgaria                               2023 <NA>    7900    2.2
 4 Switzerland                            2023 p      63870   -0.8
 5 Cyprus                                 2023 p      29080    1  
 6 Czechia                                2023 <NA>   18480   -1.2
 7 Germany                                2023 p      36290   -1.1
 8 Denmark                                2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2
# ℹ 26 more rows

Storing an intermediate result

  • To facilitate the presentation, we assign the intermediate result to a new variable.
sdg_temp <- sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)

Adding the state boolean variable

  • We observe that the geo column contains both states and coalitions of states.
  • We can use the function pull() from dplyr to examine the values of geo.
  • There are other ways to access the values of a data frame column, but pull() works well with the pipe operator and, hence, stacks well together with other dplyr transformations, such as distinct() and arrange().

Adding the state boolean variable

sdg_temp |>
  distinct(geo) |>
  arrange(geo) |>
  pull()
 [1] "Austria"                                  
 [2] "Belgium"                                  
 [3] "Bulgaria"                                 
 [4] "Croatia"                                  
 [5] "Cyprus"                                   
 [6] "Czechia"                                  
 [7] "Denmark"                                  
 [8] "Estonia"                                  
 [9] "Euro area - 19 countries  (2015-2022)"    
[10] "Euro area – 20 countries (from 2023)"     
[11] "European Union - 27 countries (from 2020)"
[12] "Finland"                                  
[13] "France"                                   
[14] "Germany"                                  
[15] "Greece"                                   
[16] "Hungary"                                  
[17] "Iceland"                                  
[18] "Ireland"                                  
[19] "Italy"                                    
[20] "Latvia"                                   
[21] "Lithuania"                                
[22] "Luxembourg"                               
[23] "Malta"                                    
[24] "Montenegro"                               
[25] "Netherlands"                              
[26] "Norway"                                   
[27] "Poland"                                   
[28] "Portugal"                                 
[29] "Romania"                                  
[30] "Serbia"                                   
[31] "Slovakia"                                 
[32] "Slovenia"                                 
[33] "Spain"                                    
[34] "Sweden"                                   
[35] "Switzerland"                              
[36] "Türkiye"                                  

Adding the state boolean variable

  • We want to create a state binary variable with the value TRUE if the observation corresponds to a state, and the value FALSE otherwise.
  • Since all coalition observations include the string "Euro", we can easily create the new boolean variable using grepl.

Adding the state boolean variable

  • Since all coalition observations include the string "Euro", we can easily create the new boolean variable using grepl.
sdg_temp |>
  mutate(state = !grepl("Euro", geo))
# A tibble: 36 × 6
   geo                                    year flag  output growth state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>
 1 Austria                                2023 <NA>   37860   -1.8 TRUE 
 2 Belgium                                2023 p      37310    0.4 TRUE 
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE 
 4 Switzerland                            2023 p      63870   -0.8 TRUE 
 5 Cyprus                                 2023 p      29080    1   TRUE 
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE 
 7 Germany                                2023 p      36290   -1.1 TRUE 
 8 Denmark                                2023 <NA>   52510    1.8 TRUE 
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE
# ℹ 26 more rows

Adding the state boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo))
# A tibble: 36 × 6
   geo                                    year flag  output growth state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>
 1 Austria                                2023 <NA>   37860   -1.8 TRUE 
 2 Belgium                                2023 p      37310    0.4 TRUE 
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE 
 4 Switzerland                            2023 p      63870   -0.8 TRUE 
 5 Cyprus                                 2023 p      29080    1   TRUE 
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE 
 7 Germany                                2023 p      36290   -1.1 TRUE 
 8 Denmark                                2023 <NA>   52510    1.8 TRUE 
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE
# ℹ 26 more rows
  • The new variable is placed by default as the last column of the data frame.

Adding the state boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo), .after = "geo")
# A tibble: 36 × 6
   geo                                   state  year flag  output growth
   <chr>                                 <lgl> <int> <chr>  <dbl>  <dbl>
 1 Austria                               TRUE   2023 <NA>   37860   -1.8
 2 Belgium                               TRUE   2023 p      37310    0.4
 3 Bulgaria                              TRUE   2023 <NA>    7900    2.2
 4 Switzerland                           TRUE   2023 p      63870   -0.8
 5 Cyprus                                TRUE   2023 p      29080    1  
 6 Czechia                               TRUE   2023 <NA>   18480   -1.2
 7 Germany                               TRUE   2023 p      36290   -1.1
 8 Denmark                               TRUE   2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022) FALSE  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)  FALSE  2023 <NA>   32150   -0.2
# ℹ 26 more rows
  • We can modify this default behavior by passing the argument .after to mutate().

Adding the state boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo), .after = "geo")
# A tibble: 36 × 6
   geo                                   state  year flag  output growth
   <chr>                                 <lgl> <int> <chr>  <dbl>  <dbl>
 1 Austria                               TRUE   2023 <NA>   37860   -1.8
 2 Belgium                               TRUE   2023 p      37310    0.4
 3 Bulgaria                              TRUE   2023 <NA>    7900    2.2
 4 Switzerland                           TRUE   2023 p      63870   -0.8
 5 Cyprus                                TRUE   2023 p      29080    1  
 6 Czechia                               TRUE   2023 <NA>   18480   -1.2
 7 Germany                               TRUE   2023 p      36290   -1.1
 8 Denmark                               TRUE   2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022) FALSE  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)  FALSE  2023 <NA>   32150   -0.2
# ℹ 26 more rows
  • Observe that we have used the exclamation mark operator ! before grepl().

Programming digression: Combining logical conditions

  • Observe that we have used the exclamation mark operator ! before grepl().
  • Logical conditions in R (as in most, if not all, programming languages) can be combined or negated.

Programming digression: Combining logical conditions

  • The ! operator negates the logical value of the condition.
!TRUE
[1] FALSE
!FALSE
[1] TRUE
!(2 > 1)
[1] FALSE
!grepl("Sanna", "Sanna is here")
[1] FALSE

Programming digression: Combining logical conditions

  • Observe that we have used the exclamation mark operator ! before grepl().
  • The call grepl("Euro", geo) returns TRUE if the string "Euro" is found in the geo column and FALSE otherwise.
  • We instead want a condition that is TRUE when geo is a state, i.e., when "Euro" is not found in the geo column.
  • Thus, we have to negate the result of grepl("Euro", geo).

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • For example, suppose we want to create a new variable that is TRUE when geo is a state and output is above 50,000.
  • We already know how to check if geo is a state with !grepl("Euro", geo).
  • We can check if output is above 50,000 with output > 50000.

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • For example, suppose we want to create a new variable that is TRUE when geo is a state and output is above 50,000.
  • Combining these with a logical AND operator gives us the desired condition.
  • In R, the logical AND operator is &.

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
sdg_temp |>
  mutate(rich_state = !grepl("Euro", geo) & output > 50000)
# A tibble: 36 × 6
   geo                                    year flag  output growth rich_state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>     
 1 Austria                                2023 <NA>   37860   -1.8 FALSE     
 2 Belgium                                2023 p      37310    0.4 FALSE     
 3 Bulgaria                               2023 <NA>    7900    2.2 FALSE     
 4 Switzerland                            2023 p      63870   -0.8 TRUE      
 5 Cyprus                                 2023 p      29080    1   FALSE     
 6 Czechia                                2023 <NA>   18480   -1.2 FALSE     
 7 Germany                                2023 p      36290   -1.1 FALSE     
 8 Denmark                                2023 <NA>   52510    1.8 TRUE      
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE     
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE     
# ℹ 26 more rows

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • We can also combine conditions using logical OR operators.
  • In R, logical OR is denoted with |.

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • For example, we can create a logical condition that is TRUE when geo is either Germany or Cyprus.
  • We check if geo is Germany with geo == "Germany".
  • We check if geo is Cyprus with geo == "Cyprus".
  • Combined, we write geo == "Germany" | geo == "Cyprus".

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
sdg_temp |>
  mutate(ger_or_cy = geo == "Germany" | geo == "Cyprus")
# A tibble: 36 × 6
   geo                                    year flag  output growth ger_or_cy
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>    
 1 Austria                                2023 <NA>   37860   -1.8 FALSE    
 2 Belgium                                2023 p      37310    0.4 FALSE    
 3 Bulgaria                               2023 <NA>    7900    2.2 FALSE    
 4 Switzerland                            2023 p      63870   -0.8 FALSE    
 5 Cyprus                                 2023 p      29080    1   TRUE     
 6 Czechia                                2023 <NA>   18480   -1.2 FALSE    
 7 Germany                                2023 p      36290   -1.1 TRUE     
 8 Denmark                                2023 <NA>   52510    1.8 FALSE    
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE    
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE    
# ℹ 26 more rows

Discretizing continuous variables

  • We have extensively used the income variable in the visualization overview topic to color and shape various geometric objects.
  • The income variable was a categorical variable with three levels: low, middle, and high.
  • How can we create it?

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We want to create a factor variable income with three levels: low, middle, and high.
  • One idea is to pick two output thresholds.
    • A low threshold, below which we consider the output as low.
    • A high threshold, above which we consider the output as high.
    • And use nested ifelse() statements to create the income variable.

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We want to create a factor variable income with three levels: low, middle, and high.
  • Let us pick 15,200 and 37,400 as the low and high output thresholds, respectively.
if (output > 37400) {
  "high"
} else if (output > 15200) {
  "middle"
} else {
  "low"
}

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We want to create a factor variable income with three levels: low, middle, and high.
  • Let us pick 15,200 and 37,400 as the low and high output thresholds, respectively.
  • Or, with inline code:
ifelse(output > 37400, "high", ifelse(output > 15200, "middle", "low"))

Adding the income categorical variable: Attempt 1 (non-reusable)

sdg_temp |>
  mutate(
    state = !grepl("Euro", geo),
    income = ifelse(
      output > 37400,
      "high",
      ifelse(output > 15200, "middle", "low")
    )
  )
# A tibble: 36 × 7
   geo                                    year flag  output growth state income
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl> <chr> 
 1 Austria                                2023 <NA>   37860   -1.8 TRUE  high  
 2 Belgium                                2023 p      37310    0.4 TRUE  middle
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE  low   
 4 Switzerland                            2023 p      63870   -0.8 TRUE  high  
 5 Cyprus                                 2023 p      29080    1   TRUE  middle
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE  middle
 7 Germany                                2023 p      36290   -1.1 TRUE  middle
 8 Denmark                                2023 <NA>   52510    1.8 TRUE  high  
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE middle
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE middle
# ℹ 26 more rows

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We have successfully created the income variable.
  • Nonetheless, this approach is not scalable.
  • First, we have hard-coded the thresholds.
  • Second, we have apparently picked the thresholds arbitrarily.
  • Third, what if we want to create a variable with more levels?
  • Fourth, what if we want to use the solution with another dataset?

Adding the income categorical variable: Attempt 1 (non-reusable)

  • A better approach is to use the cut() and quantile() functions.
  • A country is considered low income if its output is below the 25th percentile.
  • A country is considered high income if its output is above the 75th percentile.
  • And a country is considered middle income otherwise.

Using quantile()

  • We can take a look at how to use the quantile() function by accessing its documentation.
?quantile

Using quantile()

?quantile

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • The x argument in our case is the output variable.

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • We need to use the probs argument to specify the percentiles we want to use.

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • The default value of the probs argument is to set the quartile probabilities \([0, 0.25, 0.5, 0.75, 1]\) (why?).

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • We want only the 25th and 75th percentiles as thresholds. Thus, we set probs = c(0, 0.25, 0.75, 1).

Using quantile()

  • This gives the two output thresholds we are after.
quantile(sdg_temp$output, c(0, 0.25, 0.75, 1))
     0%     25%     75%    100% 
 6500.0 15192.5 37447.5 83320.0 
  • We use these thresholds to classify observations according to whether their output belongs to the three induced bins:
  • \([0, 15192.5]\): low income
  • \((15192.5, 37447.5]\): middle income
  • \((37447.5, 83320]\): high income

Using cut()

  • We can construct this classification using the cut() function.
  • The cut() function expects a numeric vector and a set of breaks.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • The result is less than stellar.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • We have gotten three levels… but they are not named low, middle, and high.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • Moreover, we have gotten an NA value.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • Relatedly, one of the levels (the lowest) is open on the left, which is not exactly how we wanted it.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] 
[25] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04]  (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • We can instruct cut() to include the lower bound in the first bin by setting include.lowest = TRUE.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] 
[25] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04]  (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • This also fixes the NA issue (why?).

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE,
  labels = c("low", "middle", "high")
)
 [1] high   middle low    high   middle middle middle high   middle middle
[11] middle middle middle middle middle middle low    low    high   high  
[21] middle middle high   low    low    middle high   high   low    middle
[31] low    low    high   middle middle low   
Levels: low middle high
  • Finally, we can instruct cut() to label the bins as low, middle, and high by setting the labels argument.

Adding the income categorical variable: Attempt 2 (reusable)

  • Combining the cut() function with the quantile() function, we can create the income variable.
  • In this approach, we do not arbitrarily choose and hard-code the thresholds.
  • And the solution we programmed can be reused if the scale of the output variable changes.

Adding the income categorical variable: Attempt 2 (reusable)

  • Combining the cut() function with the quantile() function, we can create the income variable.
  • More generally, the combination of cut() and quantile() can be used to discretize continuous variables into categorical ones based on empirical percentiles.

Adding the income categorical variable: Attempt 2 (reusable)

sdg_temp |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
# A tibble: 36 × 7
   geo                                    year flag  output growth state income
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl> <fct> 
 1 Austria                                2023 <NA>   37860   -1.8 TRUE  high  
 2 Belgium                                2023 p      37310    0.4 TRUE  middle
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE  low   
 4 Switzerland                            2023 p      63870   -0.8 TRUE  high  
 5 Cyprus                                 2023 p      29080    1   TRUE  middle
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE  middle
 7 Germany                                2023 p      36290   -1.1 TRUE  middle
 8 Denmark                                2023 <NA>   52510    1.8 TRUE  high  
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE middle
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE middle
# ℹ 26 more rows

A summary of the GDP data frame transformations

  • We summarize here the transformations we applied to the estat_sdg_08_10_en data to create the gdp data frame.
  • We want to have the code constructing the gdp data frame, from import to final structure, in a single place.
  • Furthermore, using our current understanding, we apply some improvements to the code to make it more concise, self-contained, and beautiful!

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
  • First, we observe that LAST UPDATE is not used in the final or any intermediate structure.
  • Having its type definition in line read_csv adds only noise when reading the code.

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
  • Second, the unit column type is defined in read_csv, but, because it is used as the pivoting variable in pivot_wider, it does not appear in the final structure.
  • Instead, the categorical variable flag (originally OBS_FLAG), which is used in the final structure, is never assigned a type.

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    OBS_FLAG = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
  • Third, we can use dplyr:: to call the dplyr functions directly, informing the reader where the functions come from and avoiding loading the entire dplyr namespace and masking other functions.

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    OBS_FLAG = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  dplyr::select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  dplyr::rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  dplyr::mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  dplyr::filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  dplyr::mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

The ict data frame

  • Starting from estat_isoc_sks_itspt_en.csv, we can apply similar transformations to create the ict data frame (How?).

A view at the ict data

The ict times series

  • When working on creating the gdp data frame, we have filtered the data to keep only the year 2023.
  • In the ict data frame, we have kept the time dimension of the data.
  • For each country, we have multiple observations.
  • How many observations do we have for each country?

Counting observations

  • How many observations do we have for each country?
  • We can use the count() function to answer the question.
  • Similar to other dplyr functions, count() takes the data frame as its first argument.
  • We can then specify the variables in the data frame that dictate our counting logic.
  • The result of count() gives the number of observations for each unique combination of values in the specified variables.

Counting observations

  • How many observations do we have for each country?
ict |>
  dplyr::count(geo)
# A tibble: 37 × 2
   geo                        n
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows
  • Count selects the geo variable.
  • Filters out the duplicate rows of the selected variables.
  • Creates a new column n of integer type.
  • And fills it with the number of observations for each value of geo.

Counting observations

  • If we call count() without specifying any variables, we get the total number of observations in the data frame.
ict |>
  dplyr::count()
# A tibble: 1 × 1
      n
  <int>
1   654

Counting observations

  • We can also use count() with more than one variable.
ict |>
  dplyr::mutate(after_2020 = year > 2020) |>
  dplyr::count(geo, after_2020)
# A tibble: 70 × 3
   geo                    after_2020     n
   <fct>                  <lgl>      <int>
 1 Austria                FALSE         17
 2 Austria                TRUE           3
 3 Bosnia and Herzegovina TRUE           3
 4 Belgium                FALSE         17
 5 Belgium                TRUE           3
 6 Bulgaria               FALSE         17
 7 Bulgaria               TRUE           3
 8 Switzerland            FALSE         10
 9 Switzerland            TRUE           3
10 Cyprus                 FALSE         17
# ℹ 60 more rows

Counting observations

  • The transformation of count() is a group-wise operation.
  • Behind the scenes, count() groups the data frame by the specified variables.
  • Counting is a group-wise operation that often appears in data science applications, but is not the only one.

Group-Wise operations

  • We take a closer look at group-wise transformations.
  • We start with manually replicating the behavior of count() to get a first idea of how grouping works.
  • Subsequently, we will examine more group-wise transformations.

Manually replicating count()

  • We want to group the data by country and count the number of observations for each country.
  • First, we need to instruct dplyr that we want to group the data by country.
  • We can use the group_by() function.

Using group_by()

  • We can use the group_by() function.
  • Similar to other dplyr functions, group_by() takes the data frame as its first argument.
  • We can then specify the variables in the data frame based on which we want to create groups.

Using group_by()

ict |>
  dplyr::group_by(geo)
# A tibble: 654 × 4
# Groups:   geo [37]
   geo      year ict_percentage ict_thousand_persons
   <fct>   <int>          <dbl>                <dbl>
 1 Austria  2004            2.9                 106.
 2 Austria  2005            3.1                 116 
 3 Austria  2006            3.3                 126 
 4 Austria  2007            3.3                 128.
 5 Austria  2008            3.3                 133.
 6 Austria  2009            3.5                 139.
 7 Austria  2010            3.5                 141.
 8 Austria  2011            3.6                 145 
 9 Austria  2012            3.6                 147 
10 Austria  2013            3.7                 153.
# ℹ 644 more rows
  • We observe a new line in the data frame standard output, informing us that the data frame is grouped.
  • The grouping variable is geo.
  • And there are in total 37 groups in the data.
  • Other than grouping, the data frame did not change.

Transformations per group

  • As long as a data frame is grouped, we can execute transformation operations within each group instead of on the complete data frame at once.
  • If we instruct dplyr to count observations on a grouped data frame, it counts the number of observations in each group.

Transformations per group

  • The most usual grouped operation is to create a summary.
  1. A summary calculates a summary statistic per group.
  2. It reduces the data frame such that we get a single row for each group.
  3. And assigns the calculated statistics to groups.

Using summarize()

  • The most usual grouped operation is to create a summary.
  • In dplyr, summarizing operations are performed via summarize().
  • The basic calling interface of summarize() is similar to that of mutate().
  • The first argument is a data frame.
  • Then, we specify a series of summarizing transformations that create new columns.

Using summarize()

  • The most usual grouped operation is to create a summary.
  • In dplyr, summarizing operations are performed via summarize().
  • In contrast to mutate(), summarize() returns a new data frame containing
    • the grouping variables,
    • the newly created summarizing variables, and
    • only one row per group.

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n())
# A tibble: 37 × 2
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows
  • We use dplyr’s function n() as the summarizing statistic in this example.
  • The function n() counts the number of observations in each group.
  • We assign the result of n() to a new variable nobs.
  • Observe that, after calling summarize(), the resulting data frame is not grouped.

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n())
  • In most cases, the default behavior of summarize() is to ungroup the most nested grouping variable.
  • Since, in this example, the only grouping variable is geo, removing grouping based on geo results in an ungrouped data frame.
  • We can override the default behavior by passing the argument .groups = "keep" to summarize().

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n(), .groups = "keep")
# A tibble: 37 × 2
# Groups:   geo [37]
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n(), .groups = "keep")
# A tibble: 37 × 2
# Groups:   geo [37]
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows
  • In addition, summarize() returns only the grouping and newly created variables.

Using summarize()

  • In addition, summarize() returns only the grouping and newly created variables.
  • What if we want to all retain variables and assign the corresponding number of observations within each group in a new column?

Using mutate() with groups

  • We want to retain all variables and assign the corresponding number of observations within each group in a new column.
  • In such a case, we can combine group_by(), mutate(), and n() to achieve the desired result.

Using mutate() with groups

  • We want to retain all variables and assign the corresponding number of observations within each group in a new column.
  • In a grouped data frame, mutate() applies its transformations within each group.
  • It creates the corresponding new variables.
  • Assigns the transformation results for each group to the new variables.
  • And retains all other variables in the data frame.

Using mutate() with groups

ict |>
  dplyr::group_by(geo) |>
  dplyr::mutate(nobs = dplyr::n()) |>
  dplyr::arrange(nobs)
# A tibble: 654 × 5
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons  nobs
   <fct>                  <int>          <dbl>                <dbl> <int>
 1 Bosnia and Herzegovina  2021            1.5                 17.6     3
 2 Bosnia and Herzegovina  2022            1.7                 19.3     3
 3 Bosnia and Herzegovina  2023            2                   24.3     3
 4 United Kingdom          2011            4.8               1392.      9
 5 United Kingdom          2012            5                 1461.      9
 6 United Kingdom          2013            4.9               1477.      9
 7 United Kingdom          2014            5                 1517.      9
 8 United Kingdom          2015            5.2               1624.      9
 9 United Kingdom          2016            5.3               1674       9
10 United Kingdom          2017            5.2               1657.      9
# ℹ 644 more rows
  • We observe that mutate() keeps all the rows of the original data frame.
  • It assigns the same value (the number of observations) to all group rows.
  • And, unlike summarize(), maintains the grouping.

Using mutate() with groups

ict |>
  dplyr::group_by(geo) |>
  dplyr::mutate(nobs = dplyr::n()) |>
  dplyr::ungroup()
# A tibble: 654 × 5
   geo      year ict_percentage ict_thousand_persons  nobs
   <fct>   <int>          <dbl>                <dbl> <int>
 1 Austria  2004            2.9                 106.    20
 2 Austria  2005            3.1                 116     20
 3 Austria  2006            3.3                 126     20
 4 Austria  2007            3.3                 128.    20
 5 Austria  2008            3.3                 133.    20
 6 Austria  2009            3.5                 139.    20
 7 Austria  2010            3.5                 141.    20
 8 Austria  2011            3.6                 145     20
 9 Austria  2012            3.6                 147     20
10 Austria  2013            3.7                 153.    20
# ℹ 644 more rows
  • To drop grouping, we need to manually call ungroup().

Grouped mutate() vs. summarize()

Action mutate() with grouping summarize()
Transformation Per group Per group
Assignment One value per group One value per group
Result rows All rows One row per group
Result columns All columns Grouped columns
Automated ungrouping No Yes

Other summarizing transformations

  • Summarizing transformations go beyond calculating the number of observations.
  • We can calculate means, medians, standard deviations, and other summary statistics with similar approaches.
  • Using summarize() and collapsing each group into a single row.
  • Or using mutate() and keeping all rows.

Basic summary statistics

  • For example, we can calculate the average percentage of employment in the ICT sector for each country using the mean() function.

Basic summary statistics

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(avg = mean(ict_percentage))
# A tibble: 37 × 2
   geo                      avg
   <fct>                  <dbl>
 1 Austria                 3.90
 2 Bosnia and Herzegovina  1.73
 3 Belgium                 4.44
 4 Bulgaria                2.82
 5 Switzerland             5.08
 6 Cyprus                  2.94
 7 Czechia                 3.99
 8 Germany                 3.88
 9 Denmark                 4.84
10 Estonia                 4.51
# ℹ 27 more rows
  • Observe that summarize works with base R functions like mean().
  • This is true for other base R functions like median(), sd(), min(), and max().
  • Thus, we can calculate summary statistics for each group in a dataset.

Basic summary statistics

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(
    nobs = dplyr::n(),
    min = min(ict_percentage),
    mean = mean(ict_percentage),
    median = median(ict_percentage),
    sd = sd(ict_percentage),
    max = max(ict_percentage)
  )
# A tibble: 37 × 7
   geo                     nobs   min  mean median    sd   max
   <fct>                  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
 1 Austria                   20   2.9  3.90   3.65 0.653   5.3
 2 Bosnia and Herzegovina     3   1.5  1.73   1.7  0.252   2  
 3 Belgium                   20   3.4  4.44   4.2  0.669   5.6
 4 Bulgaria                  20   2.2  2.82   2.5  0.587   4.3
 5 Switzerland               13   4.3  5.08   5    0.438   5.7
 6 Cyprus                    20   2.2  2.94   2.7  0.810   5.4
 7 Czechia                   20   3.4  3.99   3.95 0.439   4.6
 8 Germany                   20   3.1  3.88   3.7  0.550   5  
 9 Denmark                   20   4.1  4.84   4.8  0.481   5.9
10 Estonia                   20   2.5  4.51   4.1  1.29    6.7
# ℹ 27 more rows

Custom summary statistics

  • We can also use custom transformations.
ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(
    custom1 = sd(ict_percentage) / median(ict_percentage),
    custom2 = 2 * mean(ict_percentage) - 1
  )
# A tibble: 37 × 3
   geo                    custom1 custom2
   <fct>                    <dbl>   <dbl>
 1 Austria                 0.179     6.81
 2 Bosnia and Herzegovina  0.148     2.47
 3 Belgium                 0.159     7.89
 4 Bulgaria                0.235     4.65
 5 Switzerland             0.0876    9.15
 6 Cyprus                  0.300     4.87
 7 Czechia                 0.111     6.98
 8 Germany                 0.149     6.76
 9 Denmark                 0.100     8.68
10 Estonia                 0.314     8.03
# ℹ 27 more rows

Slicing data

  • Another useful data transformation operation is slicing.
  • Similar to filtering, slicing can be used to select a subset of the data.
  • Most slicing operations can also be written as filtering operations.
  • But slicing can be more convenient than filtering on some occasions.

Slicing by position

  • For example, slicing can be used to get the first \(n\) rows within a group.
ict |>
  dplyr::group_by(geo) |>
  dplyr::arrange(year) |>
  dplyr::slice_head(n=3)
# A tibble: 111 × 4
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons
   <fct>                  <int>          <dbl>                <dbl>
 1 Austria                 2004            2.9                106. 
 2 Austria                 2005            3.1                116  
 3 Austria                 2006            3.3                126  
 4 Bosnia and Herzegovina  2021            1.5                 17.6
 5 Bosnia and Herzegovina  2022            1.7                 19.3
 6 Bosnia and Herzegovina  2023            2                   24.3
 7 Belgium                 2004            3.4                143. 
 8 Belgium                 2005            3.5                147. 
 9 Belgium                 2006            3.7                156. 
10 Bulgaria                2004            2.2                 63.8
# ℹ 101 more rows

Slicing by position

  • Or to get the last \(n\) rows within a group.
ict |>
  dplyr::group_by(geo) |>
  dplyr::arrange(year) |>
  dplyr::slice_tail(n=3)
# A tibble: 111 × 4
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons
   <fct>                  <int>          <dbl>                <dbl>
 1 Austria                 2021            4.5                192. 
 2 Austria                 2022            5                  221. 
 3 Austria                 2023            5.3                237. 
 4 Bosnia and Herzegovina  2021            1.5                 17.6
 5 Bosnia and Herzegovina  2022            1.7                 19.3
 6 Bosnia and Herzegovina  2023            2                   24.3
 7 Belgium                 2021            5.6                272. 
 8 Belgium                 2022            5.6                278. 
 9 Belgium                 2023            5.4                273. 
10 Bulgaria                2021            3.5                108  
# ℹ 101 more rows

Other slicing operations

  • Besides slice_head() and slice_tail(), there are other slicing operations available in dplyr.
  • slice_sample(n = 1): Randomly selects \(n\) rows from each group.
  • slice_min(n = 1): Returns the rows with the \(n\) smallest values of a variable in a group.
  • slice_max(n = 1): Returns the rows with the \(n\) largest values of a variable in a group.

Slicing vs. filtering

ict |>
  dplyr::group_by(geo) |>
  dplyr::filter(ict_percentage == max(ict_percentage))
ict |>
  dplyr::group_by(geo) |>
  dplyr::slice_max(ict_percentage)

Accessing columns

  • On certain occasions, we want to access a column of a data frame and perform some operations on it beyond the dplyr pipeline.
  • There are multiple ways we can access a column(s) of a data frame.
  1. Using the pull() function from the dplyr package.
  2. Using the $ operator.
  3. Using the [[ operator.
  4. Using the [ operator (for multiple columns).

Pulling a column

  • One approach, which we have already seen, is to use the pull() function from the dplyr package.
pulled_col <- gdp |>
  dplyr::pull(geo)
  • The variable pulled_col is a vector containing the values of the geo column of the gdp data frame.
  • Its length is equal to the number of rows of the gdp data frame.

Pulling a column

  • Its length is equal to the number of rows of the gdp data frame.
  • We can verify this by using the length() function from base R.
length(pulled_col) == nrow(gdp)
[1] TRUE
  • Verifying that pulled_col is identical to the geo column requires a bit more work.

Programming digression: all and any

  • In R, using one of the logical comparison operators, such as ==, to compare two vectors returns a logical vector.
v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

v1 <= v2
[1] TRUE TRUE TRUE
v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

v3 == v4
[1]  TRUE FALSE
  • Each element of the resulting vector is the result of the comparison of the corresponding elements of the input vectors.

Programming digression: all and any

  • Occasionally, we want to check if a logical condition is elementwise satisfied for all elements of two vectors.
  • The all() function from base R accepts a logical vector and returns TRUE if all elements are TRUE.
v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

all(v1 <= v2)
[1] TRUE
v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

all(v3 == v4)
[1] FALSE

Programming digression: all and any

  • The any() function from base R accepts a logical vector and returns TRUE if at least one element is TRUE.
v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

any(v1 <= v2)
[1] TRUE
v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

any(v3 == v4)
[1] TRUE

Accessing a column via $

  • The dollar operator $ is a special R operator acting on vectors, lists, and data frames to extract or replace parts.
  • In the case of data frames, we can use $ to directly access the values of a column as a vector.
  • For example,
gdp$geo

Accessing a column via $

  • We can now verify that the pulled_col extracted with pull() is identical to the geo column of gdp.
all(pulled_col == gdp$geo)
[1] TRUE

Accessing a column via [[

  • Another way to access list or data frame parts is by using the double bracket [[ indexing operator.
  • For example,
gdp[["geo"]]

Accessing a column via [[

  • We can also verify that all three ways we have seen give equivalent results.
all(pulled_col == gdp[["geo"]])
[1] TRUE

Accessing columns via [

  • What if we want to access multiple columns at once?
  • We already know that we can use dplyr’s select() to select multiple columns.
  • Additionally, we can use the [ operator to access multiple columns of a data frame.

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
gdp[c("geo", "year")]
# A tibble: 36 × 2
   geo                                    year
   <chr>                                 <int>
 1 Austria                                2023
 2 Belgium                                2023
 3 Bulgaria                               2023
 4 Switzerland                            2023
 5 Cyprus                                 2023
 6 Czechia                                2023
 7 Germany                                2023
 8 Denmark                                2023
 9 Euro area - 19 countries  (2015-2022)  2023
10 Euro area – 20 countries (from 2023)   2023
# ℹ 26 more rows

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
  • We can verify that the result is equivalent to using select().
all(
  gdp[c("geo", "year")] == gdp |> dplyr::select(geo, year)
)
[1] TRUE

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
  • The [ operator can also be used to access a single column, but it returns a data frame instead of a vector.
gdp["geo"]
# A tibble: 36 × 1
   geo                                  
   <chr>                                
 1 Austria                              
 2 Belgium                              
 3 Bulgaria                             
 4 Switzerland                          
 5 Cyprus                               
 6 Czechia                              
 7 Germany                              
 8 Denmark                              
 9 Euro area - 19 countries  (2015-2022)
10 Euro area – 20 countries (from 2023) 
# ℹ 26 more rows

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
  • The [ operator can also be used to access a single column, but it returns a data frame instead of a vector.
  • Caution is advised when comparing against a vector:
all(gdp["geo"] == gdp$geo)
[1] TRUE
length(gdp["geo"]) == length(gdp$geo)
[1] FALSE
dim(gdp["geo"]) == dim(gdp$geo)
logical(0)

Programming digression: Set operations

  • We have already seen a method of getting the distinct values of a column via dplyr’s distinct().
  • We can achieve the same result using unique() from base R.
gdp_geo <- unique(gdp$geo)
  • And we can check that the two approaches give the same result.
y <- gdp |> dplyr::distinct(geo) |> dplyr::pull()
all(y == gdp_geo)
[1] TRUE

Programming digression: set operations

  • When working with vectors, it can be helpful to have some basic set operations in mind.
  • One of the commonly used operations is checking whether a vector contains an element.
  • This can be syntactically expressed using the %in% operation.
"USA" %in% gdp_geo
[1] FALSE
"Germany" %in% gdp_geo
[1] TRUE

Programming digression: set operations

  • We can take this one step further and examine whether all values in a vector belong in another vector.
  • In terms of set jargon, we want to check if a set \(S_{1}\) is a subset of another set \(S_{2}\).
  • This is true if and only if all elements of \(S_{1}\) belong in \(S_{2}\).

Programming digression: set operations

  • Using %in% with two vectors returns a new boolean vector.
x <- c("USA", "Germany")
x %in% gdp_geo
[1] FALSE  TRUE
  • The resulting vector has a length equal to the length of the vector on the left-hand side of %in%.
  • Using %in% with a vector on the left-hand side checks elementwise if the values of the left vector can be found in the values of the right vector.

Programming digression: set operations

  • With this in mind, we can combine %in% with all to examine if a set is a subset of another.
x <- c("USA", "Germany")
all(x %in% gdp_geo)
[1] FALSE
  • How can we figure out which values of the left vector do not appear on the right?

Programming digression: set operations

  • How can we figure out which values of the left vector do not appear on the right?
  • Base R has a function setdiff() that does exactly this.
x <- c("USA", "Germany")
setdiff(x, gdp_geo)
[1] "USA"
  • Calling setdiff() returns the values of the first argument that do not appear in the second argument.

Programming digression: set operations

  • The order of the arguments is important in this case because the set difference operation is not symmetric.
x <- c("USA", "Germany")
y <- c("France", "Germany")
setdiff(x, y)
[1] "USA"
x <- c("USA", "Germany")
y <- c("France", "Germany")
setdiff(y, x)
[1] "France"

Programming digression: set operations

  • We use the intersect() function to find the common elements of two vectors.
  • Unlike setdiff(), the order of the arguments does not matter because the intersection operation is symmetric.
x <- c("USA", "Germany")
y <- c("France", "Germany")
intersect(x, y)
[1] "Germany"
x <- c("USA", "Germany")
y <- c("France", "Germany")
intersect(y, x)
[1] "Germany"

Programming digression: set operations

  • To get the combination of unique values, we use union().
  • Unions are symmetric, so any order of the arguments gives the same elements.
  • The order of the elements can be different, though.
x <- c("USA", "Germany")
y <- c("France", "Germany")
union(x, y)
[1] "USA"     "Germany" "France" 
x <- c("USA", "Germany")
y <- c("France", "Germany")
union(y, x)
[1] "France"  "Germany" "USA"    

Combining data

  • So far, we have created two data frames.
  • gdp: Contains growth rates and output data for EU and (some) non-EU countries.
  • ict: Contains ICT employment data for EU and (some) non-EU countries.

Data sources in the wild

  • We have arrived at one of the most important topics in data analysis.
  • While in most educational settings, data is provided in a single unit, in real-world scenarios, data is often scattered across multiple sources.
  • Much of the time in real-world data analysis work is spent on combining information from different sources.

Data sources in the wild

  • The process of combining data sources takes many names in the wild.
  • Sometimes combining data is referred to as merging, joining, or linking.
  • On some occasions, binding, appending, and concatenating are also used.
  • Even worse, the jargon is not unified across fields, and the same term can be used with different meanings.

Data sources in the wild

  • We will follow the join wording for two reasons:
    • First, it can precisely define the exact type of data combination we want to perform.
    • Second, it is synergetic with the semantics of Structured Query Language (SQL).
  • We examine a motivating problem.

Combining the gdp and ict data frames

  • We examine the columns of the gdp and ict data frames.
  • We can do this using the names() function from base R.
names(gdp)
[1] "geo"    "year"   "flag"   "output" "growth" "state"  "income"
names(ict)
[1] "geo"                  "year"                 "ict_percentage"      
[4] "ict_thousand_persons"
  • We observe that the columns geo and year are common in both data frames.

Combining the gdp and ict data frames

  • We observe that the columns geo and year are common in both data frames.
  • The columns geo and year are common, but their underlying values are not identical.

Combining the gdp and ict data frames

  • We observe that the columns geo and year are common in both data frames.
  • The columns geo and year are common, but their underlying values are not identical.
  • For instance, the two geo columns differences are:
setdiff(unique(gdp$geo), unique(ict$geo))
[1] "Euro area - 19 countries  (2015-2022)"
[2] "Euro area – 20 countries (from 2023)" 
setdiff(unique(ict$geo), unique(gdp$geo))
[1] "Bosnia and Herzegovina" "North Macedonia"        "United Kingdom"        

Combining the gdp and ict data frames

  • This already raises a few questions when combining the two data frames.
  1. Do we want to keep all the rows of the gdp data frame?
  2. Do we want to keep all the rows of the ict data frame?
  3. Do we want to keep only the matching rows of the two data frames?
  4. Do we want to keep all the rows of both data frames?

Combining gdp and ict: First approach

  1. We want to keep all the rows of the gdp data frame.
  • And add two new columns, ict_percentage and ict_thousand_persons, with the values from the ict data frame.
  • How can we handle the rows in gdp that do not have a corresponding row in ict?

Combining gdp and ict: First approach

  1. We want to keep all the rows of the gdp data frame.
  • And add two new columns, ict_percentage and ict_thousand_persons, with the values from the ict data frame.
  • If the geo and year values of a row in gdp are equal to the geo and year values of a row in ict, then copy the values.
  • If the geo and year values of a row in gdp cannot be found in ict, then assign NA.

Combining gdp and ict: First approach

  • We can achieve this using the left_join() function from the dplyr package.
gdp |>
  dplyr::left_join(ict)
  • In its most basic use case, left_join() takes two data frames, x and y.
  • It then returns a joined data frame that maintains all the rows and columns of x and adds the columns of y that do not exist in x.
  • The matching is done based on the common columns of the two data frames.

Combining gdp and ict: First approach

  • We can achieve this using the left_join() function from the dplyr package.
gdp |>
  dplyr::left_join(ict)
# A tibble: 36 × 9
   geo                      year flag  output growth state income ict_percentage
   <chr>                   <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria                  2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium                  2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria                 2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland              2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus                   2023 p      29080    1   TRUE  middle            5.4
 6 Czechia                  2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany                  2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark                  2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Euro area - 19 countri…  2023 <NA>   32340   -0.3 FALSE middle           NA  
10 Euro area – 20 countri…  2023 <NA>   32150   -0.2 FALSE middle           NA  
# ℹ 26 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
  • First, we can check the dimensions of the resulting data frame using the dim() function.
gdp_ict_lj <- gdp |>
  dplyr::left_join(ict)
dim(gdp_ict_lj)
[1] 36  9
dim(gdp)
[1] 36  7

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
  • Second, we check that the resulting geo and year columns contain exactly the values of the gdp data frame.
all(gdp_ict_lj[c("geo", "year")] == gdp[c("geo", "year")])
[1] TRUE

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
  • Third, we check that the resulting ict_percentage and ict_thousand_persons columns are NA for the rows that do not have a corresponding row in the ict data frame.
  • To check this, we need to calculate the set difference between the unique combinations of geo and year in the gdp and ict data frames.
  • However, setdiff() works with vectors, not data frames.
  • An easy way to combine these columns is to use the paste() function.
  • Among other things, paste() accepts two or more vectors, converts them to strings, and concatenates them element-wise.
v1 <- c("Hello", "Hi")
v2 <- c("Earth", "Jupiter")
paste(v1, v2)
[1] "Hello Earth" "Hi Jupiter" 

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
gdp_ict_lj |>
  dplyr::filter(is.na(ict_thousand_persons))
# A tibble: 3 × 9
  geo                       year flag  output growth state income ict_percentage
  <chr>                    <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
1 Euro area - 19 countrie…  2023 <NA>   32340   -0.3 FALSE middle             NA
2 Euro area – 20 countrie…  2023 <NA>   32150   -0.2 FALSE middle             NA
3 Montenegro                2023 p       6900    3.7 TRUE  low                NA
# ℹ 1 more variable: ict_thousand_persons <dbl>
setdiff(paste(gdp$geo, gdp$year), paste(ict$geo, ict$year))
[1] "Euro area - 19 countries  (2015-2022) 2023"
[2] "Euro area – 20 countries (from 2023) 2023" 
[3] "Montenegro 2023"                           

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • And add five new columns flag, output, growth, state, and income with the values from the gdp data frame.

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • We can formulate this problem in terms of a left join by swapping gdp and ict in our previous approach.

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
ict |>
  dplyr::left_join(gdp)
# A tibble: 654 × 9
   geo      year ict_percentage ict_thousand_persons flag  output growth state
   <chr>   <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria  2004            2.9                 106. <NA>      NA     NA NA   
 2 Austria  2005            3.1                 116  <NA>      NA     NA NA   
 3 Austria  2006            3.3                 126  <NA>      NA     NA NA   
 4 Austria  2007            3.3                 128. <NA>      NA     NA NA   
 5 Austria  2008            3.3                 133. <NA>      NA     NA NA   
 6 Austria  2009            3.5                 139. <NA>      NA     NA NA   
 7 Austria  2010            3.5                 141. <NA>      NA     NA NA   
 8 Austria  2011            3.6                 145  <NA>      NA     NA NA   
 9 Austria  2012            3.6                 147  <NA>      NA     NA NA   
10 Austria  2013            3.7                 153. <NA>      NA     NA NA   
# ℹ 644 more rows
# ℹ 1 more variable: income <fct>

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • An equivalent way to achieve this is to use the right_join() function.

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
gdp |>
  dplyr::right_join(ict)
# A tibble: 654 × 9
   geo          year flag  output growth state income ict_percentage
   <chr>       <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria      2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium      2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria     2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland  2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus       2023 p      29080    1   TRUE  middle            5.4
 6 Czechia      2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany      2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark      2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Estonia      2023 <NA>   15250   -5.4 TRUE  middle            6.7
10 Greece       2023 p      19460    2.6 TRUE  middle            2.4
# ℹ 644 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • Observe, however, that the order of the rows and columns resulting from a left_join() and a right_join() are different.
  • Both left_join() and right_join() maintain the order of the rows and columns of the x argument.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
  • This approach is the most common type of join operation.
  • The left and right join operations introduce NA values in the resulting data frame if there are non-matching rows.
  • Statistical methods often do not work with NA values.
  • Thus, practically, we often want to keep only the matching rows.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
  • We can keep only the matching rows using the inner_join() function.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
gdp |>
  dplyr::inner_join(ict)
# A tibble: 33 × 9
   geo          year flag  output growth state income ict_percentage
   <chr>       <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria      2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium      2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria     2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland  2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus       2023 p      29080    1   TRUE  middle            5.4
 6 Czechia      2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany      2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark      2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Estonia      2023 <NA>   15250   -5.4 TRUE  middle            6.7
10 Greece       2023 p      19460    2.6 TRUE  middle            2.4
# ℹ 23 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
  • Similar to the left_join() and right_join() functions, the inner_join() function maintains the order of the rows and columns of the x argument.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
ict |>
  dplyr::inner_join(gdp)
# A tibble: 33 × 9
   geo        year ict_percentage ict_thousand_persons flag  output growth state
   <chr>     <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria    2023            5.3                237.  <NA>   37860   -1.8 TRUE 
 2 Belgium    2023            5.4                273.  p      37310    0.4 TRUE 
 3 Bulgaria   2023            4.3                126.  <NA>    7900    2.2 TRUE 
 4 Switzerl…  2023            5.7                273   p      63870   -0.8 TRUE 
 5 Cyprus     2023            5.4                 24.7 p      29080    1   TRUE 
 6 Czechia    2023            4.3                218.  <NA>   18480   -1.2 TRUE 
 7 Germany    2023            4.9               2108.  p      36290   -1.1 TRUE 
 8 Denmark    2023            5.9                177.  <NA>   52510    1.8 TRUE 
 9 Estonia    2023            6.7                 46.5 <NA>   15250   -5.4 TRUE 
10 Greece     2023            2.4                100.  p      19460    2.6 TRUE 
# ℹ 23 more rows
# ℹ 1 more variable: income <fct>

Combining gdp and ict: Fourth approach

  1. We want to keep all the rows of both data frames.
  • We can keep all the rows of both data frames using the full_join() function.

Combining gdp and ict: Fourth approach

  1. We want to keep all the rows of both data frames.
ict |>
  dplyr::full_join(gdp)
# A tibble: 657 × 9
   geo      year ict_percentage ict_thousand_persons flag  output growth state
   <chr>   <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria  2004            2.9                 106. <NA>      NA     NA NA   
 2 Austria  2005            3.1                 116  <NA>      NA     NA NA   
 3 Austria  2006            3.3                 126  <NA>      NA     NA NA   
 4 Austria  2007            3.3                 128. <NA>      NA     NA NA   
 5 Austria  2008            3.3                 133. <NA>      NA     NA NA   
 6 Austria  2009            3.5                 139. <NA>      NA     NA NA   
 7 Austria  2010            3.5                 141. <NA>      NA     NA NA   
 8 Austria  2011            3.6                 145  <NA>      NA     NA NA   
 9 Austria  2012            3.6                 147  <NA>      NA     NA NA   
10 Austria  2013            3.7                 153. <NA>      NA     NA NA   
# ℹ 647 more rows
# ℹ 1 more variable: income <fct>

Rebel Talent

Why connect?

  • Gino, Kouchaki, and Casciaro (2023, Retracted) studied the effects of people’s mindset on their feelings about networking.
  • The investigated mindsets were:
    • Promotion focus: individuals primed to think about achieving gains in their career.
    • Prevention focus: individuals primed to think about avoiding losses in their career.

Why connect?

  1. Gino, Kouchaki, and Casciaro (2023, Retracted) split participants into three groups:
  • Promotion focus: participants asked to write about their hopes and aspirations
  • Prevention focus: participants asked to write about their duties and obligations
  • Control group: participants asked to write about their daily activities

Why connect?

  1. Then, participants were asked to imagine themselves in a networking event.
    • and rate and describe their feelings about networking.
    • e.g., “How dirty do you feel about attending this networking event?”

Why connect?

  • Gino, Kouchaki, and Casciaro (2023, Retracted) reported that
    • the prevention-primed participants felt (statistically) significantly dirtier about networking than the promotion-primed participants.

Rebel talent

  • In 2018, Zoe Ziani was a Ph.D. student at ESSEC Business School (Paris, France).
  • During her studies, she came across the article of Gino, Kouchaki, and Casciaro (2023, Retracted).
  • She noticed some irregularities in the data and the methodology used in the article.

Rebel talent

  • In her post Mortem blog post (Ziani 2023), Ziani describes her journey:

Rebel talent

  • In her post Mortem blog post (Ziani 2023), Ziani describes her journey:
  • Instead, Ziani included a 10-page critique of (Gino, Kouchaki, and Casciaro 2023 Retracted) in her thesis, explaining why she chose not to rely on it in her research.
  • Two members of the thesis committee clarified that they would not sign off her dissertation until all traces of the criticism of (Gino, Kouchaki, and Casciaro 2023 Retracted) are removed!

Rebel talent

  • In her post Mortem blog post (Ziani 2023), Ziani describes her journey:
  • Ziani was unwavering and instead decided to approach Data Colada.

Data Colada

  • They investigate the robustness of empirical findings in the social sciences.
  • Posts involve quantitative analyses and replications of published studies.

Data Colada

  • Data Colada investigated the (Gino, Kouchaki, and Casciaro 2023 Retracted).

    • They replicated the study.
    • Looked at the data sources.
    • Gathered reports from whistleblowers.
    • And even checked the calcChain in the Excel file to find evidence of manipulation in different cells in the spreadsheet.

Data Colada

  • In 2021, they approached Harvard Business School (HBS) where Francesca Gino was a tenured professor.
  • The HBS dean, Srikant Datar, initiated an internal investigation to validate the findings.
  • The investigation (11 months and a confidential 1300-page report) concluded that the data in the article was fabricated.
  • In addition, three other articles by Gino were also found to be susceptible to data fabrication.

Data Colada

Simonsohn, Nelson, and Simmons (2023)

Data Colada

Simonsohn, Nelson, and Simmons (2023)

The aftermath

  • Gino was placed on administrative leave without pay by the HBS.
  • Gino filed a lawsuit against both HBS and Data Colada for defamation.
  • The confidential report was made public after a court order.
  • Gino lost the case.

More on data visualization

Prerequisites

  • We examine some aspects of ggplot2 in more detail.
library(ggplot2)
  • We will use the eu_ict and ict data frames in our visualizations.
eu_ict <- readRDS("data/eu_ict.rds")
ict <- readRDS("data/ict.rds")

Programming digression: argument passing

  • This is not a necessity.

Argument order and named arguments

  • There are three ways to pass arguments to functions in R.
  1. By exact matching.
  2. By partial matching.
  3. By position.

Exact matching

  1. By exact matching.
  • We specify the argument name and the value.
  • The names of the arguments are documented in the help files of the functions.
  • Documentation is available with ?function_name.

Exact matching

  1. By exact matching.
  • From ?ggplot, we see that the signature of ggplot() is:
Usage:

     ggplot(data = NULL, mapping = aes(), ..., 
            environment = parent.frame())
ggplot(
  data = eu_ict,
  mapping = aes(
    x = output, 
    y = ict_percentage
  )
) +
  geom_point()

Partial matching

  1. By partial matching.
  • We specify partially the argument name and the value.
  • We can provide the minimum number of initial characters (or more) that uniquely identify the argument.

Partial matching

  1. By partial matching.
ggplot(
  d = eu_ict,
  m = aes(
    x = output, 
    y = ict_percentage
  )
) +
  geom_point()

ggplot(
  d = eu_ict,
  map = aes(
    x = output, 
    y = ict_percentage
  )
) +
  geom_point()

Positional arguments

  1. By position
  • We specify only values.
  • Arguments are matched by their position in the function signature.

Positional arguments

  1. By position
Usage:

     ggplot(data = NULL, mapping = aes(), ..., 
            environment = parent.frame())
ggplot(
  eu_ict,
  aes(x = output, y = ict_percentage)
) +
  geom_point()

Positional arguments

  • The same applies to the aes() function.
Usage:

     aes(x, y, ...)
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point()

Why so many ways?

  • Each approach has pros and cons.

Why so many ways?

  • Exact matching is quite verbose.
  • But it is self-documenting and less error-prone.
  • It is a good practice to use it when readability is important. For example:
    • In scripts that are shared with others that are not familiar with the used functions.
    • When calling functions that you do not use frequently, and you might need to revisit the code after a long time.

Why so many ways?

  • Positional matching is concise.
  • It is a good practice to use it for commonly used functions where the risk of confusion is low. E.g.,

    ifelse(x > 10, "large", "small")

    instead of

    ifelse(test = x > 10, yes = "large", no = "small")
  • Using it in R’s command line for experimentation can be easier.

  • But it can make reading code less self-contained.

Aesthetics mappings

Defining aesthetics

  • Aesthetics can be defined at various levels when creating a plot.
  • Globally in the ggplot() call.
ggplot(
  eu_ict,
  aes(
    output, ict_percentage, 
    color = income
  )
) +
  geom_point()

Defining aesthetics

  • Aesthetics can be defined at various levels when creating a plot.
  • Locally at the level of each layer.
ggplot(eu_ict) +
  geom_point(
    aes(
      output, ict_percentage, 
      color = income
    )
  )

Defining aesthetics

  • Aesthetics can be defined at various levels when creating a plot.
  • Mixed at the ggplot() and layer level.
  1. How does this work?
  2. Why does it work in this way?
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income))

Defining aesthetics: how

  1. How does this work?
  • Global assignments in ggplot() affect all layers in the plot.
  • Local assignments in geom_*() only apply to that layer.
  • Local assignments take precedence over (override) global assignments.
ggplot(
  eu_ict,
  aes(
    output, ict_percentage, 
    color=income
  )
) +
  geom_point(aes(color = EU))

Defining aesthetics: why

  1. Why does it work in this way?
  • We can specify the aesthetics of certain layers with attributes that we do not want to apply globally.

Defining aesthetics: why

  1. Why does it work in this way?
ggplot(
  eu_ict, 
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income)) +
  geom_smooth(method = "lm")

ggplot(
  eu_ict, 
  aes(
    output, ict_percentage, 
    color = income
  )
) +
  geom_point() +
  geom_smooth(method = "lm")

Defining aesthetics

  1. Why does it work in this way?
  • We also have the flexibility to extend the aesthetics of distinct geometric objects differently.
ggplot(
  eu_ict, aes(output, ict_percentage)
) +
  geom_point(
    aes(color = income, shape = EU)
  ) +
  geom_smooth(
    method="lm", aes(color = EU)
  )

Theming

  • Aesthetics modify the appearance of the plot’s data elements.
  • How can we modify the appearance of the plot’s non-data elements (e.g., axes, background, grid)?
  • The ggplot2 package provides a set of eight basic themes.

Basic themes

Basic themes

Basic themes

  • We can apply themes to the plot by adding them with the + operator.
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE
  ) +
  theme_classic()

ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE
  ) +
  theme_bw()

Basic themes

  • Themes only affect the non-data elements of the plot.
  • To modify the appearance of the data elements, we still need to use aesthetics.
ggplot(
  eu_ict,
  aes(output, ict_percentage)
) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE
  ) +
  theme_bw()

Basic themes

  • Themes only affect the non-data elements of the plot.
  • To modify the appearance of the data elements, we still need to use aesthetics.
ggplot(
  eu_ict,
  aes(output, ict_percentage, color = income, shape = EU)
) +
  geom_point() +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE,
    color = "black"
  ) +
  theme_bw() +
  scale_color_grey()

Basic themes

  • Themes only affect the non-data elements of the plot.
  • Here, we have combined the theme_bw() with scale_color_grey() to modify the appearance of the data elements.
  • In addition, we have explicitly specified the color of the geom_smooth() object to be black.
ggplot(
  eu_ict,
  aes(output, ict_percentage, color = income, shape = EU)
) +
  geom_point() +
  geom_smooth(
    aes(linetype = EU, group = EU),
    se = FALSE,
    color = "black"
  ) +
  theme_bw() +
  scale_color_grey()

Additional themes

  • If the basic themes do not meet the stylistic requirements that you want or being asked to follow, taking a look at the ggthemes package is a good idea.

Additional themes

  • The ggthemes package provides additional themes that might match the desired style.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  geom_smooth(aes(linetype = EU, group = EU), se = FALSE) +
  ggthemes::theme_excel_new()

Guides

  • Guides are reference lines, grids, or markers assisting in interpreting the geometric object of the visualization.
  • Axes and legends are the two guides that are most commonly modified in a visualization to facilitate communication.

Axes

  • Axes are the typically horizontal and vertical lines that specify the coordinate system of the plot area.
  • Axes have breaks (ticks) and labels.
    • Breaks are the marked points of an axis.
    • Labels are the text accompanying the breaks and provide interpretation context for the axis.

Labels

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")
  • How can we modify the breaks of an axis?
  • How can we modify the labels of these breaks?

Labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")
  • Not very intuitively, the breaks and labels of an axis are not modified through labs().
  • This is because ggplot2 does some heavy lifting for us when drawing the axes of a plot.
  • Recall that we have used the same calling interface for creating plots with continuous and discrete axes variables (e.g., geom_point and geom_bar).

Scales

  • In the background, ggplot2 automatically adjusts the axes based on the type of the variable we provide.
  • It does so by using the scale_*() family of functions.
  • Scales are instructions controlling how certain aesthetic mappings are translated into visual properties.
  • For example, a continuous scale maps the values of an aesthetic to a continuous axis range.

Scales

  • In ggplot2, continuous variables in geom_point() objects are automatically assigned to a continuous scale scale_x_continuous().
  • Discrete variables in geom_bar() objects are automatically assigned to a discrete scale scale_color_discrete().
  • We can modify the default behavior and the appearance of axes by explicitly calling the scale_*() functions.

Breaks

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %")

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  labs(x = "Year", y = "ICT employment %") +
  scale_x_continuous(
    breaks = c(2004, 2014, 2023)
  )

  • We can pass directly the breaks we want to have on a continuous axis using the breaks argument.
  • For instance, if we want to have all the years as breaks, we can pass the year column of the ict data frame.

Breaks

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line()  +
  labs(x = "Year", y = "ICT employment %") +
  scale_x_continuous(breaks = ict$year)

Breaks

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line()  +
  labs(x = "Year", y = "ICT employment %") +
  scale_x_continuous(breaks = ict$year)
  • In addition, if we want to modify the labels of the breaks, we can use the labels argument of scale_x_continuous().
  • Suppose, for example, that instead of having the years on the x-axis, we want to have labels formatted as Year YYYY, where YYYY is the year.

Breaks and their labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  scale_x_continuous(
    breaks = seq(2004, 2023, 2),
    labels = paste("Year", seq(2004, 2023, 2))
  )

Programming digression: creating sequences

  • We have used the seq() function to create the breaks and labels of the x-axis.
  • The seq() function creates sequences of numbers.
  • There are a few ways to create sequences in R.

Programming digression: creating sequences

  • The legacy way of creating sequences is to use the : operator.
  • The : operator is used with infix notation.
  • It takes two arguments, from and to, and creates a sequence of integers from from to to.
10:20
 [1] 10 11 12 13 14 15 16 17 18 19 20

Programming digression: creating sequences

  • The : operator has a few disadvantages.
  • First, it only works with a step of 1 or -1 if the from is smaller than the to.
  • Second, it can be error-prone when combined with arithmetic operations.
1:3 * 2
[1] 2 4 6
1:(3 * 2)
[1] 1 2 3 4 5 6

Programming digression: creating sequences

  • A safer and more flexible way to create sequences is to use the seq() function.
  • The seq() function can create sequences with an arbitrary step size.
seq(1, 3, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0

Programming digression: creating sequences

  • A safer and more flexible way to create sequences is to use the seq() function.
  • Or it can create sequences between two numbers with a specific length.
seq(1, 3, length.out = 10)
 [1] 1.000000 1.222222 1.444444 1.666667 1.888889 2.111111 2.333333 2.555556
 [9] 2.777778 3.000000

Programming digression: creating sequences

  • A safer and more flexible way to create sequences is to use the seq() function.
  • Further, compared to the : operator, there is less risk of confusion when combining seq() with arithmetic operations.
seq(1, 3) * 2
[1] 2 4 6
seq(1, 3 * 2)
[1] 1 2 3 4 5 6

Programming digression: creating sequences

  • There are two very useful siblings of seq(), named seq_along() and seq_len().

Programming digression: creating sequences

  • The seq_along() function creates a sequence of integers from 1 to the length of the input vector.
v1 <- c("a", "b", "c")
seq_along(v1)
[1] 1 2 3
  • This is useful when we want to enumerate the elements of a vector.
  • Compared to:
v1 <- c("a", "b", "c")
seq(length(v1))
[1] 1 2 3

Programming digression: creating sequences

  • The seq_len() function creates a sequence of integers from 1 to the input number.
seq_len(5)
[1] 1 2 3 4 5
  • It gives the same result as seq(1, 5).

Rotating breaks’ labels

  • When plotting high-frequency time-series data, the labels of the breaks on the horizontal axis can get crowded.
  • One common way to address this issue is to rotate the labels.
  • Rotating the labels does not affect the breaks or the labels themselves, only their orientation.
  • We can rotate the breaks’ labels using the theme() function.

Rotating breaks’ labels

  • The theme() function has (a lot of) options for modifying the plot’s theming.
  • We can modify the appearance of the axes using the axis.text.x and axis.text.y arguments.
  • The element_text() function is used to modify the appearance of the labels’ text.
  • Rotating the labels is done by setting the angle argument to the desired angle (in degrees).
  • The vjust and hjust arguments control the vertical and horizontal justification of the text.

Rotating breaks’ labels

ict |>
  dplyr::filter(grepl("Euro", geo)) |>
  ggplot(aes(year, ict_percentage)) +
  geom_line() +
  scale_x_continuous(breaks = ict$year, labels = ict$year) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5))

Legends

  • Another useful option exposed by theme() is the legend.position argument.
ict |>
  dplyr::filter(geo %in% sample(unique(ict$geo), size = 5)) |>
  ggplot(aes(year, ict_percentage, color = geo)) +
  geom_line() +
  theme(legend.position = "top")

Legends

  • The legend.position argument can take the following values:
    • "none": no legend is displayed.
    • "left", "right", "top", "bottom": the legend is displayed on the left, right, top, or bottom of the plot area.
    • "inside": the legend is displayed inside the plot area.

Legends

  • Besides customization via theme(), legends can be modified using the guides() function.
  • The guides() function offers more fine-grained control over the appearance of the legend.
  • For example, we can modify the number of rows or columns of the legend.
  • Or we can override the size and shape of the legend markers.

Legends

ict |>
  dplyr::filter(
    geo %in% sample(unique(ict$geo), size = 8)
  ) |>
  ggplot(
    aes(year, ict_percentage, color = geo)
  ) +
  geom_line() + 
  theme(legend.position = "top") +
  guides(
    color = guide_legend(
      title = "Country",
      nrow = 2,
      override.aes = list(linewidth = 4)
    )
  )

Statistical transformations

  • How are data mapped to geometric objects?
  • When creating a bar chart, we pass one column to geom_bar(), and the function automatically calculates the height of the bars.
  • When creating a density plot, we pass one column to geom_density(), and the function automatically calculates the density of the data.

Behind every geometric object

  • How are data mapped to geometric objects?
  • This pattern is common in ggplot2.
  • The passed data is transformed into a new form that is used to create the plot.
  • We examine some more details of the statistical transformations taking place behind the scenes when creating geometric objects.

The statistic behind geom_bar()

eu_ict |>
  ggplot(aes(income)) +
  geom_bar()

  • Where is the count variable of the vertical axis coming from?

The statistic behind geom_bar()

  • Where is the count variable of the vertical axis coming from?
eu_ict |>
  ggplot(aes(income)) +
  geom_bar()
  • We have never defined count as an aesthetic.
  • Even stranger, count is not among the columns of the eu_ict dataset.
names(eu_ict)
[1] "geo"            "EU"             "ict_percentage" "income"        
[5] "output"        

The statistic behind geom_bar()

  • Examining the documentation of geom_bar(), we observe that there is a stat argument that defaults to count.

Usage:

 geom_bar(
   mapping = NULL,
   data = NULL,
   stat = "count",
   position = "stack",
   ...,
   just = 0.5,
   width = NULL,
   na.rm = FALSE,
   orientation = NA,
   show.legend = NA,
   inherit.aes = TRUE
 )

The statistic behind geom_bar()

  • Behind the scenes, geom_bar() calculates the number of times each value of income is found in the data.
eu_ict |>
  dplyr::count(income)
# A tibble: 3 × 2
  income     n
  <fct>  <int>
1 low        3
2 middle    17
3 high      12
  • And then uses the new variable to set the heights.

The statistic behind geom_bar()

  • We can manually replicate the calculation and instruct geom_bar() not to perform any further transformation.
  • Instructing a geom_* function not to apply any statistical transformation to the input data is done by passing stat = "identity" to the function.
eu_ict |>
  dplyr::count(income) |>
  dplyr::rename(count = n) |>
  ggplot(aes(income, count)) +
  geom_bar(stat = "identity")

The statistics behind geom_smooth()

  • Other geom_* functions calculate different statistics by default.
  • For instance, geom_smooth() calculates fitted values, standard errors, and confidence intervals.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  stat_smooth(method = "lm")

Programming digression: formulas

  • How can we replicate the geom_smooth()’s statistics?
  • The "lm" part of the method argument stands for linear model.
  • Linear models are statistical models having linear relationships between the dependent and independent variables.
  • For example, classic linear regressions are linear models.

Programming digression: formulas

  • In R, we have a neat way to define statistical models using formulas.
  • Using formulas with ggplot2 and statistical functions allows us to focus on relationships in the data and leave the details of the statistical calculations to the functions.

Programming digression: formulas

  • A basic formula in R has two main parts:

    • The left-hand side (LHS) of the formula is the dependent variable.
    • The right-hand side (RHS) of the formula is the independent variable(s).
    • The two parts are separated by a tilde ~ symbol.
  • For example:
ict_percentage ~ output
ict_percentage ~ output

Programming digression: formulas

  • Had we had more than one independent variable, we would have written:
ict_percentage ~ output + ind_var2 + ind_var3
ict_percentage ~ output + ind_var2 + ind_var3
  • Note that we used the variables ind_var1 and ind_var2 in the formula, which neither were defined nor exist in any of our datasets.
  • And R does not complain about it.

Programming digression: formulas

  • Had we had more than one independent variable, we would have written:
ict_percentage ~ output + ind_var2 + ind_var3
ict_percentage ~ output + ind_var2 + ind_var3
  • This is because the formula does not actually calculate anything.
  • It is an unevaluated expression that explains the logic of the model.

Linear regressions

  • We can fit a linear model using formulas with the lm() function.
fit <- lm(ict_percentage ~ output, eu_ict)
  • The first argument of lm() is the formula we want to estimate.
  • The second argument is the dataset.
  • The lm() function automatically searches for the formula variables in the dataset and fits the model.

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • Symbolically, we estimated the model,

    \[ y_{i} = \beta_{0} + \beta_{1} x_{i} + \varepsilon_{i}, \]

    • \(y_{i}\) is the ict_percentage,
    • \(x_{i}\) is the output,
    • \(\beta_{0}\) is the intercept,
    • \(\beta_{1}\) is the slope, and
    • \(\varepsilon_{i}\) is the error term.

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the predicted values from the model?

    \[ \hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{i} \]

pred_y <- predict(fit)

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the confidence intervals?

\[ ce(\hat{y}_{i}) = [\hat{y}_{i} - t_{\alpha/2} \times \sigma(\hat{y}_{i}), \hat{y}_{i} + t_{\alpha/2} \sigma (\hat{y}_{i})] \]

where

\[ \sigma(\hat{y}_{i}) = \hat\sigma \sqrt{\frac{1}{n} + \frac{(x_{i} - \bar{x})^{2}}{\sum_{i=1}^{n} (x_{i} - \bar{x})^{2}}} \]

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the confidence intervals?
pred_y <- predict(fit, interval = "confidence")
  • When passing interval = "confidence", the predict() function returns a matrix with three columns:
    • fit: the predicted values,
    • lwr: the lower bound of the confidence interval, and
    • upr: the upper bound of the confidence interval.
  • We can examine the first few rows with head().

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
  • How can we extract the confidence intervals?
pred_y <- predict(fit, interval = "confidence")
  • We can examine the first few rows with head().
head(pred_y)
       fit      lwr      upr
1 5.277672 4.814736 5.740609
2 5.251859 4.792837 5.710880
3 3.871531 3.201614 4.541448
4 6.498425 5.648143 7.348707
5 4.865592 4.427362 5.303822
6 4.368092 3.852660 4.883525
  • We can now replicate the geom_smooth()’s statistics.

Linear regressions

fit <- lm(ict_percentage ~ output, eu_ict)
pred_y <- predict(fit, interval = "confidence")
eu_ict |>
  dplyr::mutate(
    pred = pred_y[, "fit"],
    ymin = pred_y[, "lwr"],
    ymax = pred_y[, "upr"]
  ) |>
  ggplot(aes(output)) +
  geom_line(aes(y = pred), color = "blue", linewidth = 1) +
  geom_ribbon(
    aes(ymin = ymin, ymax = ymax),
    fill = "darkgray",
    alpha = 0.5
  )

eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  stat_smooth(method = "lm")

Text and annotations

  • A common pattern in data science visualizations is the use of annotations and text.
  • Text and annotations are commonly used to:
    • Provide context
    • Highlight specific data points
    • Explain the data
    • Add captions

Captions

  • Captions can be effortlessly added to a figure with labs().

Captions

ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = EU)) +
  geom_smooth(method = "lm") +
  labs(caption = "Data Source: Eurostat")

Captions

ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = EU)) +
  geom_smooth(method = "lm") +
  labs(caption = "Data Source: Eurostat")

Labels with Formulas

  • On some occasions, data scientists want to include the underlying mathematical details of their work in a visualization.
  • For example, suppose we are working with the function \[ f(x) = \sin(2x) + e^{-x/10} \cdot \cos(x) \]
  • We can create a plot of \(f\) with geom_line().

Labels with Formulas

\[ f(x) = \sin(2x) + e^{-x/10} \cdot \cos(x) \]

data <- data.frame(x = seq(0, 50, .01))
ggplot(data, aes(x = x)) +
  geom_line(aes(y = sin(2*x) + exp(-x/10) * cos(x)), color = "red")

Labels with Formulas

  • Notice that ggplot2 writes the formula expression as a string in the vertical axis label.
  • However, the used formatting is rather unusual for human readers.
  • For instance, multiplication is denoted with *, while in mathematical typography it is usually omitted.
  • We can use quote() in labs() to instruct ggplot2 to render the expression in a more human-customary way.

Labels with Formulas

data <- data.frame(x = seq(0, 50, .01))
ggplot(data, aes(x = x)) +
  geom_line(aes(y = sin(2*x) + exp(-x/10) * cos(x)), color = "red") +
  labs(x = quote(x), y = quote(sin(2 * x) + exp(-x/10) * cos(x)))

Annotations

  • Besides captions and labels, we can add text and markers directly into the main body of the plot.
  • These types of additions to a plot are called annotations.
  • An annotation is an additional piece of information that is added to a plot and facilitates the interpretation of its data elements.
  • In ggplot2, annotations and text can be added with annotate() and geom_text().

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point()

  • We start with the usual geom_point() plot.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(aes(label = geo))

  • We pass the label = geo aesthetic to geom_text() to create a text object using country names.
  • However, this creates a text object for all data points.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo)
  )

  • We override the data argument of geom_text() to filter only the non-EU countries.
  • This looks more like what we want to achieve.
  • Still, the text and the points are overlapping.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo),
    hjust = "left",
    vjust = "top"
  )

  • We use hjust = "left" and vjust = "top" to align the text to the top-left corner.
  • Better, but some extra spacing could improve the aesthetics.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo),
    hjust = "left",
    vjust = "top",
    nudge_y = -0.1
  )

  • We use nudge_y = -0.1 to move the text slightly below its data point.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.
eu_ict |>
  ggplot(aes(output, ict_percentage)) +
  geom_point() +
  geom_text(
    data = eu_ict |> 
      dplyr::filter(EU == "NON-EU"),
    aes(label = geo),
    hjust = "left",
    vjust = "top",
    nudge_y = -0.1,
    size = 4
  )

  • Finally, we can adjust the text size with the size argument.

Using geom_text(): Example 1

  • We want to textually highlight the non-EU countries in the eu_ict’s scatter plot.

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  )

  • Suppose we want to add the country names at the end of each regression line.
  • We can pick the maximum output value per group and use it as the x aesthetic in geom_text().

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  ) +
  geom_text(
    data = eu_ict |>
      dplyr::group_by(EU) |>
      dplyr::slice_max(output, n = 1)
  )
  • We use slice_max() to pick the maximum output value per group.
  • And override the data argument of geom_text() to use the sliced data.

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  ) +
  geom_text(
    data = eu_ict |>
      dplyr::group_by(EU) |>
      dplyr::slice_max(output, n = 1),
    aes(output, ict_percentage, label = EU)
  )

  • We pass the aesthetics we want to use in the mapping argument of geom_text().

Using geom_text(): Example 2

  • We want to textually highlight regression lines per group.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income)) +
  geom_smooth(
    aes(group = EU), method = "lm", se = FALSE
  ) +
  geom_text(
    data = eu_ict |>
      dplyr::group_by(EU) |>
      dplyr::slice_max(output, n = 1),
    aes(output, ict_percentage, label = EU),
    hjust = "left",
    vjust = "bottom",
    nudge_x = 1000
  )

  • And fine-tune the appearance of the text with hjust, vjust, and nudge_x.

Annotating

  • Another way to add text to a plot is with annotate().
  • In contrast to geom_text(), which is a geometric object, annotate() does not act on data points.
  • This means that annotate() does not require a data argument.
  • And it is more useful for adding small, data-independent elements to a plot.

Using annotate() for text

  • We want to add a label next to the richest EU country.
  • We start once more with a geom_point() scatter plot of the eu_ict’s income data.

Using annotate() for text

  • We want to add a label next to the richest EU country.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU))

Using annotate() for text

  • We want to add a label next to the richest EU country.
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(geom = "label")
  • And add an annotate() of geometric type label.
  • On its own, this gives an error if executed.
  • We need to provide the label text for the annotation.
  • And specify the position for the label using the x and y arguments.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(geom = "label")
  • We can do some preliminary data transformations to find the richest EU country.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output,
    y = richest$ict_percentage
  )

  • And use the calculated richest data to pass more information to annotate().

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output,
    y = richest$ict_percentage,
    hjust = "right"
  )

  • We can adjust the label’s text horizontal alignment.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output,
    y = richest$ict_percentage,
    hjust = "right"
  )

  • The annotate() function does not have nudge_* arguments (why?).
  • We can directly adjust the x and y positions to move the label around.

Using annotate() for text

  • We want to add a label next to the richest EU country.
richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 1000,
    y = richest$ict_percentage,
    hjust = "right"
  )

Using annotate() for text

Using annotate() for segments

  • Using annotate() is very useful for adding segments and arrows to a plot.
  • The calling interface is mostly similar to annotate() for text.
  • Instead of geom = "label", we use geom = "segment" to create annotations with segments and arrows.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right"
  )

  • We nudge the label of the richest country a bit more to the left.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right"
  ) +
  annotate(geom = "segment")
  • And add a new annotate() layer with geom = "segment" to create a segment.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right"
  ) +
  annotate(geom = "segment")
  • Specifying a segment on a plane is equivalent to specifying two points.
  • The annotate() function expects two points to draw the segment.
  • The points are specified by the x, y, xend, and yend arguments.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right",
  ) +
  annotate(
    geom = "segment",
    x = richest$output - 8000,
    xend = richest$output - 500,
    y = richest$ict_percentage,
    yend = richest$ict_percentage
  )

  • This creates a segment connecting the two points, but not an arrowhead.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right",
  ) +
  annotate(
    geom = "segment",
    x = richest$output - 8000,
    xend = richest$output - 500,
    y = richest$ict_percentage,
    yend = richest$ict_percentage,
    arrow = arrow(type = "closed", length = unit(0.3, "cm"))
  )

  • We can add an arrowhead to the plot using the arrow argument.

Using annotate() for segments

richest <- eu_ict |>
  dplyr::filter(EU == "EU") |>
  dplyr::slice_max(output, n = 1)
ggplot(eu_ict, aes(output, ict_percentage)) +
  geom_point(aes(color = income, shape = EU)) +
  annotate(
    geom = "label",
    label = richest$geo,
    x = richest$output - 8000,
    y = richest$ict_percentage,
    hjust = "right",
  ) +
  annotate(
    geom = "segment",
    x = richest$output - 8000,
    xend = richest$output - 500,
    y = richest$ict_percentage,
    yend = richest$ict_percentage,
    arrow = arrow(type = "closed", length = unit(0.3, "cm")),
    color = "red"
  )

  • Finally, the color of the annotation can be adjusted with the color argument.

References

Aragão, Roberto, and Lukas Linsi. 2022. “Many Shades of Wrong: What Governments Do When They Manipulate Statistics.” Review of International Political Economy 29 (1): 88–113. https://doi.org/10.1080/09692290.2020.1769704.
Baldick, Chris. 2008. “Ekphrasis.” In The Oxford Dictionary of Literary Terms. Oxford University Press. https://www.oxfordreference.com/view/10.1093/acref/9780199208272.001.0001/acref-9780199208272-e-366.
Cui, Zheyuan, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. 2024. “The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers.” SSRN. https://doi.org/10.2139/ssrn.4945566.
Eurostat. 2025a. “Employed ICT Specialists by Sex.” https://doi.org/10.2908/isoc_sks_itsps.
———. 2025b. “Employment by Sex, Age and Citizenship.” https://doi.org/10.2908/isoc_sks_itsps.
———. 2025c. “Real GDP Per Capita.” https://doi.org/10.2908/sdg_08_10.
Gino, Francesca, Maryam Kouchaki, and Tiziana Casciaro. 2023. “Retracted Article: Why Connect? Moral Consequences of Networking with a Promotion or Prevention Focus.” Journal of Personality and Social Psychology 125 (3): 648–48. https://doi.org/10.1037/pspa0000351.
Simonsohn, Uri, Leif Nelson, and Joe Simmons. 2023. “[112] Data Falsificada (Part 4): Forgetting The Words.” Data Colada. https://datacolada.org/112.
Van Zee, Art. 2009. “The Promotion and Marketing of Oxycontin: Commercial Triumph, Public Health Tragedy.” American Journal of Public Health 99 (2): 221–27. https://doi.org/10.2105/AJPH.2007.131714.
Ziani, Zoé. 2023. “A Post Mortem on the Gino Case.” The Organizational Plumber. https://www.theorgplumber.com/posts/statement/.