An introduction to data visualization

A top-down introduction to the basic concepts of data visualization. Layers, aesthetics and geometric objects. Coloring. Scatter plots. Bars, histograms, and density plots. Box plots.

Published

May 1st, 2025

1 A Jump Start

  • We use a prepared dataset to directly skip to the visualization part.
eu_ict <- readRDS("data/eu_ict.rds")

A Jump Start

  • In R, datasets are stored as data frames. A data frame is a rectangular arrangement of data. The columns correspond to data variables, and the rows correspond to observations.
  • In this example, we use an enhanced type of data frame called a tibble. Tibbles are a modern re-implementation of data frames that comes with the tidyverse ecosystem.

A Jump Start

  • The eu_ict data frame contains GDP values and occupation percentages in the ICT sector in 32 EU and non-EU countries for the year 2023.

A Jump Start

  • A variable (column in the data frame) is a characteristic of an object or entity that can be measured.

A Jump Start

  • An observation (row in the data frame) is a set of measurements for an object or entity collected under similar conditions.

A Jump Start

  • A value (cell in the data frame) is an instance of a variable for a particular observation.

2 A first scatter plot

  • We use ggplot2 to visualize the data.
  • The ggplot2 package is a flexible plotting system for R based on the grammar of graphics.
library(ggplot2)

2.1 An empty canvas

  • We start by creating an empty canvas for the visualization.
ggplot(data = eu_ict)

2.2 Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)
  • We specify the axes via the mapping argument.
  • Defining the mapping uses the aes() function.
  • The aes stands for aesthetics.

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)
  • The aes stands for aesthetics.
  • Aesthetics are visual properties (lines, curves, shapes, colors, etc.) of the visualization.

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)
  • The aes stands for aesthetics.
  • The x and y arguments specify the variables to be plotted on the horizontal and vertical axes, respectively.

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)

Adding Axes

ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
)

2.3 Adding a layer

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  • Doing so requires adding a layer to the canvas.
  • Adding a layer is done via the + operator.

Adding a layer

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  • This code snippet, however, is incomplete.

Adding a layer

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  • Executing it in an R terminal changes the prompt from > to +.
> ggplot(
+   data = eu_ict,
+   mapping = aes(x = output, y = ict_percentage)
+ ) +

  • This indicates that the R interpreter is waiting for more input.

2.4 Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • There are many functions in ggplot2, starting with geom_, that add different types of layers.
  • For example, geom_line(), geom_bar(), geom_boxplot(), etc.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • The geom prefix stands for geometric object.
  • A geometric object is a visual representation of (a subset of) the data.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • The geom prefix stands for geometric object.
  • The point suffix specifies we want to represent the data as points.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()
  • We use the geom_point() function to add a scatter plot.
  • The pattern is similar for other functions in the geom_ family.
  • For example, geom_line gives geometric representations of the data as lines.

Adding a scatter plot

  • We aim to add a scatter plot on the canvas.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point()

2.5 Coloring

  • We aim to colorize the points based on the EU membership.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point()
  • The EU variable of the eu_ict data frame is a categorical variable.

2.6 Programming digression: factor variables

  • The EU variable of the eu_ict data frame is a categorical variable.
  • Categorical variables in R are stored as factors.
print(eu_ict, n = 3)
# A tibble: 32 × 5
  geo      EU    ict_percentage income output
  <fct>    <fct>          <dbl> <fct>   <dbl>
1 Austria  EU               5.3 high    37860
2 Belgium  EU               5.4 high    37310
3 Bulgaria EU               4.3 low      7900
# ℹ 29 more rows

Programming digression: factor variables

  • The EU variable of the eu_ict data frame is a categorical variable.
  • Categorical variables in R are stored as factors.
print(eu_ict, n = 3)
# A tibble: 32 × 5
  geo      EU    ict_percentage income output
  <fct>    <fct>          <dbl> <fct>   <dbl>
1 Austria  EU               5.3 high    37860
2 Belgium  EU               5.4 high    37310
3 Bulgaria EU               4.3 low      7900
# ℹ 29 more rows

Programming digression: factor variables

  • The EU variable of the eu_ict data frame is a categorical variable.
  • Categorical variables in R are stored as factors.
  • Factor variables have levels that represent the different categories.
  • The EU factor variable has two levels: EU and non-EU.
  • We can colorize the points based on these levels.

Coloring the points

  • We aim to colorize the points based on the EU membership.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point()

2.7 Adding curves

  • We aim to add linearly fitted lines to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point() +
  geom_smooth(method = "lm")
  • We add another layer using the + operator.
  • We use geom_smooth to add fitted lines.

Adding curves

  • We aim to add linearly fitted lines to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point() +
  geom_smooth(method = "lm")
  • We use geom_smooth to add fitted lines.
  • The geom_smooth can be used for adding different types of fitted lines.
  • The method argument specifies the type of the fitted line.
  • In this case, we use method = "lm" for linear (model) fitted line.

Adding curves

  • We aim to add linearly fitted lines to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage, color = EU)
) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

2.8 Adding a single curve

  • We aim to add a single linearly fitted line to the scatter plot.
`geom_smooth()` using formula = 'y ~ x'

  • The last code chunk respected the coloring aesthetics we defined earlier.
  • What if we want to add a single fitted line to the scatter plot?

Adding a single curve

  • We aim to add a single linearly fitted line to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU)) +
  geom_smooth(method = "lm")
  • Aesthetics need not be defined globally for all layers.
  • We can define them locally for each layer.
  • We can use aes() to define aesthetics for each geometric object.

Adding a single curve

  • We aim to add a single linearly fitted line to the scatter plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

2.9 Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
`geom_smooth()` using formula = 'y ~ x'

  • On some occasions, differentiating points based on colors might not be enough (e.g., printing in black and white).

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
`geom_smooth()` using formula = 'y ~ x'

  • Instead, we can use different shapes to represent different categories.
  • Or, we can combine both color and shape aesthetics.

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm")

Adding shape aesthetics

  • We aim to reshape point markers based on a categorical variable.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

2.10 Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
  • We use the labs() function to add titles and labels.
  • We can additionally modify the legend via labs().

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
`geom_smooth()` using formula = 'y ~ x'

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
`geom_smooth()` using formula = 'y ~ x'

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
`geom_smooth()` using formula = 'y ~ x'

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
`geom_smooth()` using formula = 'y ~ x'

Adding titles and labels

  • We aim to display some text information on the plot.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  )
`geom_smooth()` using formula = 'y ~ x'

2.11 Color scaling

  • We aim to recolor the figure in grayscale.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  ) +
  scale_color_grey()
  • We can use the scale_color_grey() function.
  • There are other color scaling functions available, such as scale_color_brewer(), scale_color_continuous(), scale_color_colorblind(), etc.

Color scaling

  • We aim to recolor the figure in grayscale.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  ) +
  scale_color_grey()
`geom_smooth()` using formula = 'y ~ x'

  • The scale_color_grey() function did not recolor the fitted line (why?).
  • We can manually set the color of the fitted line.

Color scaling

  • We aim to recolor the figure in grayscale.
ggplot(
  data = eu_ict,
  mapping = aes(x = output, y = ict_percentage)
) +
  geom_point(mapping = aes(color = EU, shape = EU)) +
  geom_smooth(method = "lm", color = "darkgray") +
  labs(
    title = "ICT employment and Output",
    subtitle = "EU27 vs. non-EU27 countries",
    x = "Output per capita (EUR)",
    y = "ICT employment (percentage of total employment)",
    color = "Membership",
    shape = "Membership"
  ) +
  scale_color_grey()
`geom_smooth()` using formula = 'y ~ x'

3 Visualizing empirical distributions

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.

3.1 Visualizing empirical distributions discretely

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.
  • Histograms and bar plots visualize empirical distributions as a series of bars.
  • Each bar represents a bin of data points.
  • The height of the bar represents the frequency of the data points in the corresponding bin.

3.2 Visualizing empirical distributions continuously

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.
  • Density plots visualize empirical distributions as a continuous curve.
  • The curve is an estimate of the population’s probability density function from which the sample was drawn.

3.3 Bars or densities?

  • We can visualize the empirical distribution of variables using histograms, bar plots, and density plots.
  • Bars can be used with both continuous (histogram) and categorical (bar plot) variables.
  • Density plots are more suitable for continuous variables.

3.4 A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()
  • We inform ggplot2 that we want to use the income using aes().
  • Since we only use one variable in the bar plot, we do not need to specify both the x and y aesthetics.

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()
  • We inform ggplot2 that we wish to create a bar plot using the geom_bar() function.

A first bar plot

  • We wish to create a bar plot of the income variable in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = income)) +
  geom_bar()

  • The income variable is categorical and has three levels: low, middle, and high.
  • The levels are depicted on the axis used in the aes() call.
  • The other axis depicts the number of observations in each level.

3.5 Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram()
  • We use the function geom_histogram() to create a histogram of a continuous variable.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • In contrast to categorical variables, for which the number of bins is determined by the number of levels, histograms of continuous variables can be created with different numbers of bins.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • By default, the geom_histogram() function uses 30 bins.
  • This might not be what we want.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram(bins = 5)
  • We can use the bins argument of the geom_histogram() function to change the default behavior.

Histograms of continuous variables

  • We aim to create a histogram of the (continuous) output variable.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_histogram(bins = 5)

3.6 Density plots

  • An alternative way to visualize the distribution of a continuous variable is to use a density plot.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_density()
  • We use the function geom_density() to create a density plot.

Density plots

  • An alternative way to visualize the distribution of a continuous variable is to use a density plot.
ggplot(data = eu_ict, mapping = aes(x = output)) +
  geom_density()

Density plots

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • Unlike the histogram case, the vertical axis of the density plot does not measure frequencies of observations.
  • Very roughly, one can think of the density plots as having smoothed-out values of the number of observations over bins with tiny widths.
  • These values can be greater or smaller than 1.

3.7 Ordering bar plots levels

  • The bar plot of the income variable we created earlier displays the levels in the order they appear in the data.
  • In some cases, we may want to have bins ordered by their frequency.

Ordering bar plots levels

  • We can change the order in which the levels are displayed using the function fct_infreq.
  • The fct_infreq function reorders the levels of a factor based on their frequency.
  • Other options are fct_inorder, which orders levels in the order they appear in the data, and fct_inseq, which orders levels by the numeric value of their levels.
  • The fct_ family of functions is part of the forcats package (part of the tidyverse).

Ordering bar plots levels

  • We need to load the forcats package to use these functions.
library(forcats)
ggplot(data = eu_ict, mapping = aes(x = fct_infreq(income))) +
  geom_bar()

3.8 Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
  • How can we achieve this?
  • Creating a bar plot with the EU categorical variable does not provide income information.

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU)) +
  geom_bar()

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU)) +
  geom_bar()
  • We can ask ggplot2 to color the bars by the income variable.

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()
  • Setting the fill aesthetic to income will color each bar according to the number of countries in each income level.

Coloring bar plots

  • Let us plot the distribution of income levels by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()

  • One issue with the last plot is that splitting countries based on EU membership is very unbalanced.

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()

  • There are many more EU countries than NON-EU countries in the dataset, making comparisons challenging.

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar()

  • We can ask ggplot2 to normalize the bar heights and color by the income shares within each EU membership category.

Coloring bar plots

  • Let us plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar(position = "fill")
  • We can achieve this by setting the position argument of the geom_bar() function to fill.

Coloring bar plots

  • We would like to plot the distribution of income level shares by eu countries in the eu_ict dataset.
ggplot(data = eu_ict, mapping = aes(x = EU, fill = income)) +
  geom_bar(position = "fill")

3.9 Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
  • We can plot the density of ict_percentage using the geom_density() function.
  • One idea is to instruct ggplot2 to colorize based on income using the color aesthetic.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income
  )
) +
  geom_density()

  • We can further highlight the plot’s densities by using a fill aesthetic instead of or alongside color.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income,
    fill = income
  )
) +
  geom_density()

  • We can make the plot easier to read by using a fill aesthetic instead of or alongside color.
  • However, with overlapping densities, it is difficult to distinguish the density shape of each group.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income,
    fill = income
  )
) +
  geom_density()

  • We can pass an alpha (transparency) value to the geom_density() function to make the plot more readable.

Coloring density plots

  • We would like to plot the distributions of ict_percentage per income group.
ggplot(
  data = eu_ict,
  mapping = aes(
    x = ict_percentage,
    color = income,
    fill = income
  )
) +
  geom_density(alpha = 0.5)

  • We can pass an alpha (transparency) value to the geom_density() function to make the plot more readable.
  • The alpha value ranges from 0 (completely transparent) to 1 (completely opaque).

4 A first box plot

  • We would like to concisely visualize the basic statistics of ict_percentage per income group.
  • The median, first, and third quartiles are usual statistics of interest.

4.1 Box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
  • The box plot is a visualization method that demonstrates the location, spread, skewness, and outliers of a variable.
  • Compared to density plots, box plots explicitly display the median, quartiles, and outliers of a variable.

Box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
  • The first quartile (or the 25th percentile) is the value below which 25% of the data falls.
  • The second quartile (or the median) is the value below which 50% of the data falls.
  • The third quartile (or the 75th percentile) is the value below which 75% of the data falls.
  • The difference between the first and third quartiles is called the interquartile range (IQR).

Box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
  • An outlier is a value that is very different from the remaining values of a variable.
  • There are different thresholds to define outliers, e.g., the values outside the 1st and 99th percentiles, but there is no consensus universally applied in every dataset.
  • By default, box plots in ggplot2 define outliers as values that are more than 1.5 times the IQR below the first quartile or above the third quartile.

4.2 Visualizing box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
ggplot(data = eu_ict, mapping = aes(x = income, y = ict_percentage)) +
  geom_boxplot()
  • Creating a box plot in ggplot2 uses the geom_boxplot function.
  • We need to specify the x and y aesthetics to define the variables to be plotted.

Visualizing box plots

  • We want to concisely visualize the basic statistics of ict_percentage per income group.
ggplot(data = eu_ict, mapping = aes(x = income, y = ict_percentage)) +
  geom_boxplot()

Visualizing box plots

  • The median is displayed as a solid horizontal line inside each box.
  • The first and third quartiles are displayed as the lower and upper edges of each box.
  • Whiskers indicate the range of non-outlier values.
  • Outliers are displayed as individual points.
  • Skewness is indicated by the position of the median relative to the quartiles.