More on data transformation

A follow-up look on data transformation topics. Boolean columns. Discretizing continuous variables. Counting observations. Groupwise operations. Summarizing. Slicing operations. Accessing columns. Combining data with joins.

Published

May 29th, 2025

1 Prerequisites

  • We examine some aspects of dplyr in more detail.
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Prerequisites

sdg <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

1.1 Programming digression: R namespaces

  • The snippet for importing the sdg data slightly differs from the code we used in the data import overview topic.

Programming digression: R namespaces

sdg <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • vs:
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

Programming digression: R namespaces

  • We have modified all calls to the readr package’s functions!
    • read_csv \(\rightarrow\) readr::read_csv
    • col_factor \(\rightarrow\) readr::col_factor
    • col_integer \(\rightarrow\) readr::col_integer
    • col_datetime \(\rightarrow\) readr::col_datetime
  • However, we have never loaded the readr package!
  • And the two snippets execute exactly the same underlying operations!

Programming digression: R namespaces

  • Many programming languages, but not all, have a namespace feature.
  • A namespace is a collection of identifiers, such as function names, variable names, and other symbols, within a single scope (or context).

Programming digression: R namespaces

  • Organizing and managing function and object names in namespaces is a way to avoid conflicts between different modules.

Programming digression: R namespaces

  • Each package in R has its own namespace.
  • When importing a package with library(), all the functions and objects in the package’s namespace become available globally.
  • This is why, after library(dplyr), we can use select(), filter(), etc., without specifying the package.

Programming digression: R namespaces

  • Why would one want to use readr::read_csv instead of read_csv?
    • After all, read_csv is shorter and easier to type.
  1. Readability: It is easier to understand the code.
  2. Avoiding conflicts: If two packages have functions with the same name, you can specify which one you want to use.

Programming digression: R namespaces

  1. Readability: It is easier to understand the code.
  • First, readr::read_csv informs directly the reader where the read_csv function comes from.
  • This is relevant both for your coworkers when you share your code with them and for your future self reading the code.
  • There is no need to scroll up to see which packages are loaded.

Programming digression: R namespaces

  1. Avoiding conflicts: If two packages have functions with the same name, you can specify which one you want to use.
  • Let’s revisit what happens when we load the dplyr package for the first time in an R session.

Programming digression: R namespaces

  • Let’s revisit what happens when we load the dplyr package for the first time in an R session.
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Programming digression: R namespaces

  • If instead of loading dplyr, we use its functions with the dplyr:: prefix, we avoid masking functions from other namespaces.
  • This is less error-prone because we avoid scenarios where we loaded a package and forgot it masks a function from another package we also use.
  • However, it makes dependency management a bit more cumbersome because the loaded packages are not listed at a single point in the code.

2 Creating boolean variables

  • A brief reminder of where we left off:
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)
  • We have used mutate() to modify the values of the unit column.
  • Now, we will use mutate() to create the new columns.

2.1 Storing an intermediate result

  • A brief reminder of where we left off:
sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)
# A tibble: 36 × 5
   geo                                    year flag  output growth
   <chr>                                 <int> <chr>  <dbl>  <dbl>
 1 Austria                                2023 <NA>   37860   -1.8
 2 Belgium                                2023 p      37310    0.4
 3 Bulgaria                               2023 <NA>    7900    2.2
 4 Switzerland                            2023 p      63870   -0.8
 5 Cyprus                                 2023 p      29080    1  
 6 Czechia                                2023 <NA>   18480   -1.2
 7 Germany                                2023 p      36290   -1.1
 8 Denmark                                2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2
# ℹ 26 more rows

Storing an intermediate result

  • To facilitate the presentation, we assign the intermediate result to a new variable.
sdg_temp <- sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)

2.2 Adding the state boolean variable

  • We observe that the geo column contains both states and coalitions of states.
  • We can use the function pull() from dplyr to examine the values of geo.
  • There are other ways to access the values of a data frame column, but pull() works well with the pipe operator and, hence, stacks well together with other dplyr transformations, such as distinct() and arrange().

Adding the state boolean variable

sdg_temp |>
  distinct(geo) |>
  arrange(geo) |>
  pull()
 [1] "Austria"                                  
 [2] "Belgium"                                  
 [3] "Bulgaria"                                 
 [4] "Croatia"                                  
 [5] "Cyprus"                                   
 [6] "Czechia"                                  
 [7] "Denmark"                                  
 [8] "Estonia"                                  
 [9] "Euro area - 19 countries  (2015-2022)"    
[10] "Euro area – 20 countries (from 2023)"     
[11] "European Union - 27 countries (from 2020)"
[12] "Finland"                                  
[13] "France"                                   
[14] "Germany"                                  
[15] "Greece"                                   
[16] "Hungary"                                  
[17] "Iceland"                                  
[18] "Ireland"                                  
[19] "Italy"                                    
[20] "Latvia"                                   
[21] "Lithuania"                                
[22] "Luxembourg"                               
[23] "Malta"                                    
[24] "Montenegro"                               
[25] "Netherlands"                              
[26] "Norway"                                   
[27] "Poland"                                   
[28] "Portugal"                                 
[29] "Romania"                                  
[30] "Serbia"                                   
[31] "Slovakia"                                 
[32] "Slovenia"                                 
[33] "Spain"                                    
[34] "Sweden"                                   
[35] "Switzerland"                              
[36] "Türkiye"                                  

Adding the state boolean variable

  • We want to create a state binary variable with the value TRUE if the observation corresponds to a state, and the value FALSE otherwise.
  • Since all coalition observations include the string "Euro", we can easily create the new boolean variable using grepl.

Adding the state boolean variable

  • Since all coalition observations include the string "Euro", we can easily create the new boolean variable using grepl.
sdg_temp |>
  mutate(state = !grepl("Euro", geo))
# A tibble: 36 × 6
   geo                                    year flag  output growth state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>
 1 Austria                                2023 <NA>   37860   -1.8 TRUE 
 2 Belgium                                2023 p      37310    0.4 TRUE 
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE 
 4 Switzerland                            2023 p      63870   -0.8 TRUE 
 5 Cyprus                                 2023 p      29080    1   TRUE 
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE 
 7 Germany                                2023 p      36290   -1.1 TRUE 
 8 Denmark                                2023 <NA>   52510    1.8 TRUE 
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE
# ℹ 26 more rows

Adding the state boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo))
# A tibble: 36 × 6
   geo                                    year flag  output growth state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>
 1 Austria                                2023 <NA>   37860   -1.8 TRUE 
 2 Belgium                                2023 p      37310    0.4 TRUE 
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE 
 4 Switzerland                            2023 p      63870   -0.8 TRUE 
 5 Cyprus                                 2023 p      29080    1   TRUE 
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE 
 7 Germany                                2023 p      36290   -1.1 TRUE 
 8 Denmark                                2023 <NA>   52510    1.8 TRUE 
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE
# ℹ 26 more rows
  • The new variable is placed by default as the last column of the data frame.

Adding the state boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo), .after = "geo")
# A tibble: 36 × 6
   geo                                   state  year flag  output growth
   <chr>                                 <lgl> <int> <chr>  <dbl>  <dbl>
 1 Austria                               TRUE   2023 <NA>   37860   -1.8
 2 Belgium                               TRUE   2023 p      37310    0.4
 3 Bulgaria                              TRUE   2023 <NA>    7900    2.2
 4 Switzerland                           TRUE   2023 p      63870   -0.8
 5 Cyprus                                TRUE   2023 p      29080    1  
 6 Czechia                               TRUE   2023 <NA>   18480   -1.2
 7 Germany                               TRUE   2023 p      36290   -1.1
 8 Denmark                               TRUE   2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022) FALSE  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)  FALSE  2023 <NA>   32150   -0.2
# ℹ 26 more rows
  • We can modify this default behavior by passing the argument .after to mutate().

Adding the state boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo), .after = "geo")
# A tibble: 36 × 6
   geo                                   state  year flag  output growth
   <chr>                                 <lgl> <int> <chr>  <dbl>  <dbl>
 1 Austria                               TRUE   2023 <NA>   37860   -1.8
 2 Belgium                               TRUE   2023 p      37310    0.4
 3 Bulgaria                              TRUE   2023 <NA>    7900    2.2
 4 Switzerland                           TRUE   2023 p      63870   -0.8
 5 Cyprus                                TRUE   2023 p      29080    1  
 6 Czechia                               TRUE   2023 <NA>   18480   -1.2
 7 Germany                               TRUE   2023 p      36290   -1.1
 8 Denmark                               TRUE   2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022) FALSE  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)  FALSE  2023 <NA>   32150   -0.2
# ℹ 26 more rows
  • Observe that we have used the exclamation mark operator ! before grepl().

2.3 Programming digression: Combining logical conditions

  • Observe that we have used the exclamation mark operator ! before grepl().
  • Logical conditions in R (as in most, if not all, programming languages) can be combined or negated.

Programming digression: Combining logical conditions

  • The ! operator negates the logical value of the condition.
!TRUE
[1] FALSE
!FALSE
[1] TRUE
!(2 > 1)
[1] FALSE
!grepl("Sanna", "Sanna is here")
[1] FALSE

Programming digression: Combining logical conditions

  • Observe that we have used the exclamation mark operator ! before grepl().
  • The call grepl("Euro", geo) returns TRUE if the string "Euro" is found in the geo column and FALSE otherwise.
  • We instead want a condition that is TRUE when geo is a state, i.e., when "Euro" is not found in the geo column.
  • Thus, we have to negate the result of grepl("Euro", geo).

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • For example, suppose we want to create a new variable that is TRUE when geo is a state and output is above 50,000.
  • We already know how to check if geo is a state with !grepl("Euro", geo).
  • We can check if output is above 50,000 with output > 50000.

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • For example, suppose we want to create a new variable that is TRUE when geo is a state and output is above 50,000.
  • Combining these with a logical AND operator gives us the desired condition.
  • In R, the logical AND operator is &.

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
sdg_temp |>
  mutate(rich_state = !grepl("Euro", geo) & output > 50000)
# A tibble: 36 × 6
   geo                                    year flag  output growth rich_state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>     
 1 Austria                                2023 <NA>   37860   -1.8 FALSE     
 2 Belgium                                2023 p      37310    0.4 FALSE     
 3 Bulgaria                               2023 <NA>    7900    2.2 FALSE     
 4 Switzerland                            2023 p      63870   -0.8 TRUE      
 5 Cyprus                                 2023 p      29080    1   FALSE     
 6 Czechia                                2023 <NA>   18480   -1.2 FALSE     
 7 Germany                                2023 p      36290   -1.1 FALSE     
 8 Denmark                                2023 <NA>   52510    1.8 TRUE      
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE     
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE     
# ℹ 26 more rows

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • We can also combine conditions using logical OR operators.
  • In R, logical OR is denoted with |.

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
  • For example, we can create a logical condition that is TRUE when geo is either Germany or Cyprus.
  • We check if geo is Germany with geo == "Germany".
  • We check if geo is Cyprus with geo == "Cyprus".
  • Combined, we write geo == "Germany" | geo == "Cyprus".

Programming digression: Combining logical conditions

  • Besides negation, logical conditions can be combined to form more complex conditions.
sdg_temp |>
  mutate(ger_or_cy = geo == "Germany" | geo == "Cyprus")
# A tibble: 36 × 6
   geo                                    year flag  output growth ger_or_cy
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>    
 1 Austria                                2023 <NA>   37860   -1.8 FALSE    
 2 Belgium                                2023 p      37310    0.4 FALSE    
 3 Bulgaria                               2023 <NA>    7900    2.2 FALSE    
 4 Switzerland                            2023 p      63870   -0.8 FALSE    
 5 Cyprus                                 2023 p      29080    1   TRUE     
 6 Czechia                                2023 <NA>   18480   -1.2 FALSE    
 7 Germany                                2023 p      36290   -1.1 TRUE     
 8 Denmark                                2023 <NA>   52510    1.8 FALSE    
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE    
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE    
# ℹ 26 more rows

3 Discretizing continuous variables

  • We have extensively used the income variable in the visualization overview topic to color and shape various geometric objects.
  • The income variable was a categorical variable with three levels: low, middle, and high.
  • How can we create it?

3.1 Adding the income categorical variable: Attempt 1 (non-reusable)

  • We want to create a factor variable income with three levels: low, middle, and high.
  • One idea is to pick two output thresholds.
    • A low threshold, below which we consider the output as low.
    • A high threshold, above which we consider the output as high.
    • And use nested ifelse() statements to create the income variable.

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We want to create a factor variable income with three levels: low, middle, and high.
  • Let us pick 15,200 and 37,400 as the low and high output thresholds, respectively.
if (output > 37400) {
  "high"
} else if (output > 15200) {
  "middle"
} else {
  "low"
}

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We want to create a factor variable income with three levels: low, middle, and high.
  • Let us pick 15,200 and 37,400 as the low and high output thresholds, respectively.
  • Or, with inline code:
ifelse(output > 37400, "high", ifelse(output > 15200, "middle", "low"))

Adding the income categorical variable: Attempt 1 (non-reusable)

sdg_temp |>
  mutate(
    state = !grepl("Euro", geo),
    income = ifelse(
      output > 37400,
      "high",
      ifelse(output > 15200, "middle", "low")
    )
  )
# A tibble: 36 × 7
   geo                                    year flag  output growth state income
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl> <chr> 
 1 Austria                                2023 <NA>   37860   -1.8 TRUE  high  
 2 Belgium                                2023 p      37310    0.4 TRUE  middle
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE  low   
 4 Switzerland                            2023 p      63870   -0.8 TRUE  high  
 5 Cyprus                                 2023 p      29080    1   TRUE  middle
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE  middle
 7 Germany                                2023 p      36290   -1.1 TRUE  middle
 8 Denmark                                2023 <NA>   52510    1.8 TRUE  high  
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE middle
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE middle
# ℹ 26 more rows

Adding the income categorical variable: Attempt 1 (non-reusable)

  • We have successfully created the income variable.
  • Nonetheless, this approach is not scalable.
  • First, we have hard-coded the thresholds.
  • Second, we have apparently picked the thresholds arbitrarily.
  • Third, what if we want to create a variable with more levels?
  • Fourth, what if we want to use the solution with another dataset?

Adding the income categorical variable: Attempt 1 (non-reusable)

  • A better approach is to use the cut() and quantile() functions.
  • A country is considered low income if its output is below the 25th percentile.
  • A country is considered high income if its output is above the 75th percentile.
  • And a country is considered middle income otherwise.

3.2 Using quantile()

  • We can take a look at how to use the quantile() function by accessing its documentation.
?quantile

Using quantile()

?quantile

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • The x argument in our case is the output variable.

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • We need to use the probs argument to specify the percentiles we want to use.

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • The default value of the probs argument is to set the quartile probabilities \([0, 0.25, 0.5, 0.75, 1]\) (why?).

Using quantile()

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)
  • We want only the 25th and 75th percentiles as thresholds. Thus, we set probs = c(0, 0.25, 0.75, 1).

Using quantile()

  • This gives the two output thresholds we are after.
quantile(sdg_temp$output, c(0, 0.25, 0.75, 1))
     0%     25%     75%    100% 
 6500.0 15192.5 37447.5 83320.0 
  • We use these thresholds to classify observations according to whether their output belongs to the three induced bins:
  • \([0, 15192.5]\): low income
  • \((15192.5, 37447.5]\): middle income
  • \((37447.5, 83320]\): high income

3.3 Using cut()

  • We can construct this classification using the cut() function.
  • The cut() function expects a numeric vector and a set of breaks.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • The result is less than stellar.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • We have gotten three levels… but they are not named low, middle, and high.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • Moreover, we have gotten an NA value.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • Relatedly, one of the levels (the lowest) is open on the left, which is not exactly how we wanted it.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] 
[25] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04]  (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • We can instruct cut() to include the lower bound in the first bin by setting include.lowest = TRUE.

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE
)
 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] 
[25] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04]  (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
  • This also fixes the NA issue (why?).

Using cut()

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE,
  labels = c("low", "middle", "high")
)
 [1] high   middle low    high   middle middle middle high   middle middle
[11] middle middle middle middle middle middle low    low    high   high  
[21] middle middle high   low    low    middle high   high   low    middle
[31] low    low    high   middle middle low   
Levels: low middle high
  • Finally, we can instruct cut() to label the bins as low, middle, and high by setting the labels argument.

3.4 Adding the income categorical variable: Attempt 2 (reusable)

  • Combining the cut() function with the quantile() function, we can create the income variable.
  • In this approach, we do not arbitrarily choose and hard-code the thresholds.
  • And the solution we programmed can be reused if the scale of the output variable changes.

Adding the income categorical variable: Attempt 2 (reusable)

  • Combining the cut() function with the quantile() function, we can create the income variable.
  • More generally, the combination of cut() and quantile() can be used to discretize continuous variables into categorical ones based on empirical percentiles.

Adding the income categorical variable: Attempt 2 (reusable)

sdg_temp |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
# A tibble: 36 × 7
   geo                                    year flag  output growth state income
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl> <fct> 
 1 Austria                                2023 <NA>   37860   -1.8 TRUE  high  
 2 Belgium                                2023 p      37310    0.4 TRUE  middle
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE  low   
 4 Switzerland                            2023 p      63870   -0.8 TRUE  high  
 5 Cyprus                                 2023 p      29080    1   TRUE  middle
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE  middle
 7 Germany                                2023 p      36290   -1.1 TRUE  middle
 8 Denmark                                2023 <NA>   52510    1.8 TRUE  high  
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE middle
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE middle
# ℹ 26 more rows

4 A summary of the GDP data frame transformations

  • We summarize here the transformations we applied to the estat_sdg_08_10_en data to create the gdp data frame.
  • We want to have the code constructing the gdp data frame, from import to final structure, in a single place.
  • Furthermore, using our current understanding, we apply some improvements to the code to make it more concise, self-contained, and beautiful!

4.1 The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
  • First, we observe that LAST UPDATE is not used in the final or any intermediate structure.
  • Having its type definition in line read_csv adds only noise when reading the code.

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
  • Second, the unit column type is defined in read_csv, but, because it is used as the pivoting variable in pivot_wider, it does not appear in the final structure.
  • Instead, the categorical variable flag (originally OBS_FLAG), which is used in the final structure, is never assigned a type.

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    OBS_FLAG = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )
  • Third, we can use dplyr:: to call the dplyr functions directly, informing the reader where the functions come from and avoiding loading the entire dplyr namespace and masking other functions.

The gdp data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    OBS_FLAG = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  dplyr::select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  dplyr::rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  dplyr::mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  dplyr::filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  dplyr::mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

5 The ict data frame

  • Starting from estat_isoc_sks_itspt_en.csv, we can apply similar transformations to create the ict data frame (How?).

5.1 A view at the ict data

5.2 The ict times series

  • When working on creating the gdp data frame, we have filtered the data to keep only the year 2023.
  • In the ict data frame, we have kept the time dimension of the data.
  • For each country, we have multiple observations.
  • How many observations do we have for each country?

5.3 Counting observations

  • How many observations do we have for each country?
  • We can use the count() function to answer the question.
  • Similar to other dplyr functions, count() takes the data frame as its first argument.
  • We can then specify the variables in the data frame that dictate our counting logic.
  • The result of count() gives the number of observations for each unique combination of values in the specified variables.

Counting observations

  • How many observations do we have for each country?
ict |>
  dplyr::count(geo)
# A tibble: 37 × 2
   geo                        n
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows
  • Count selects the geo variable.
  • Filters out the duplicate rows of the selected variables.
  • Creates a new column n of integer type.
  • And fills it with the number of observations for each value of geo.

Counting observations

  • If we call count() without specifying any variables, we get the total number of observations in the data frame.
ict |>
  dplyr::count()
# A tibble: 1 × 1
      n
  <int>
1   654

Counting observations

  • We can also use count() with more than one variable.
ict |>
  dplyr::mutate(after_2020 = year > 2020) |>
  dplyr::count(geo, after_2020)
# A tibble: 70 × 3
   geo                    after_2020     n
   <fct>                  <lgl>      <int>
 1 Austria                FALSE         17
 2 Austria                TRUE           3
 3 Bosnia and Herzegovina TRUE           3
 4 Belgium                FALSE         17
 5 Belgium                TRUE           3
 6 Bulgaria               FALSE         17
 7 Bulgaria               TRUE           3
 8 Switzerland            FALSE         10
 9 Switzerland            TRUE           3
10 Cyprus                 FALSE         17
# ℹ 60 more rows

Counting observations

  • The transformation of count() is a group-wise operation.
  • Behind the scenes, count() groups the data frame by the specified variables.
  • Counting is a group-wise operation that often appears in data science applications, but is not the only one.

6 Group-Wise operations

  • We take a closer look at group-wise transformations.
  • We start with manually replicating the behavior of count() to get a first idea of how grouping works.
  • Subsequently, we will examine more group-wise transformations.

6.1 Manually replicating count()

  • We want to group the data by country and count the number of observations for each country.
  • First, we need to instruct dplyr that we want to group the data by country.
  • We can use the group_by() function.

6.2 Using group_by()

  • We can use the group_by() function.
  • Similar to other dplyr functions, group_by() takes the data frame as its first argument.
  • We can then specify the variables in the data frame based on which we want to create groups.

Using group_by()

ict |>
  dplyr::group_by(geo)
# A tibble: 654 × 4
# Groups:   geo [37]
   geo      year ict_percentage ict_thousand_persons
   <fct>   <int>          <dbl>                <dbl>
 1 Austria  2004            2.9                 106.
 2 Austria  2005            3.1                 116 
 3 Austria  2006            3.3                 126 
 4 Austria  2007            3.3                 128.
 5 Austria  2008            3.3                 133.
 6 Austria  2009            3.5                 139.
 7 Austria  2010            3.5                 141.
 8 Austria  2011            3.6                 145 
 9 Austria  2012            3.6                 147 
10 Austria  2013            3.7                 153.
# ℹ 644 more rows
  • We observe a new line in the data frame standard output, informing us that the data frame is grouped.
  • The grouping variable is geo.
  • And there are in total 37 groups in the data.
  • Other than grouping, the data frame did not change.

6.3 Transformations per group

  • As long as a data frame is grouped, we can execute transformation operations within each group instead of on the complete data frame at once.
  • If we instruct dplyr to count observations on a grouped data frame, it counts the number of observations in each group.

Transformations per group

  • The most usual grouped operation is to create a summary.
  1. A summary calculates a summary statistic per group.
  2. It reduces the data frame such that we get a single row for each group.
  3. And assigns the calculated statistics to groups.

6.4 Using summarize()

  • The most usual grouped operation is to create a summary.
  • In dplyr, summarizing operations are performed via summarize().
  • The basic calling interface of summarize() is similar to that of mutate().
  • The first argument is a data frame.
  • Then, we specify a series of summarizing transformations that create new columns.

Using summarize()

  • The most usual grouped operation is to create a summary.
  • In dplyr, summarizing operations are performed via summarize().
  • In contrast to mutate(), summarize() returns a new data frame containing
    • the grouping variables,
    • the newly created summarizing variables, and
    • only one row per group.

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n())
# A tibble: 37 × 2
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows
  • We use dplyr’s function n() as the summarizing statistic in this example.
  • The function n() counts the number of observations in each group.
  • We assign the result of n() to a new variable nobs.
  • Observe that, after calling summarize(), the resulting data frame is not grouped.

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n())
  • In most cases, the default behavior of summarize() is to ungroup the most nested grouping variable.
  • Since, in this example, the only grouping variable is geo, removing grouping based on geo results in an ungrouped data frame.
  • We can override the default behavior by passing the argument .groups = "keep" to summarize().

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n(), .groups = "keep")
# A tibble: 37 × 2
# Groups:   geo [37]
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows

Using summarize()

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n(), .groups = "keep")
# A tibble: 37 × 2
# Groups:   geo [37]
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows
  • In addition, summarize() returns only the grouping and newly created variables.

Using summarize()

  • In addition, summarize() returns only the grouping and newly created variables.
  • What if we want to all retain variables and assign the corresponding number of observations within each group in a new column?

6.5 Using mutate() with groups

  • We want to retain all variables and assign the corresponding number of observations within each group in a new column.
  • In such a case, we can combine group_by(), mutate(), and n() to achieve the desired result.

Using mutate() with groups

  • We want to retain all variables and assign the corresponding number of observations within each group in a new column.
  • In a grouped data frame, mutate() applies its transformations within each group.
  • It creates the corresponding new variables.
  • Assigns the transformation results for each group to the new variables.
  • And retains all other variables in the data frame.

Using mutate() with groups

ict |>
  dplyr::group_by(geo) |>
  dplyr::mutate(nobs = dplyr::n()) |>
  dplyr::arrange(nobs)
# A tibble: 654 × 5
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons  nobs
   <fct>                  <int>          <dbl>                <dbl> <int>
 1 Bosnia and Herzegovina  2021            1.5                 17.6     3
 2 Bosnia and Herzegovina  2022            1.7                 19.3     3
 3 Bosnia and Herzegovina  2023            2                   24.3     3
 4 United Kingdom          2011            4.8               1392.      9
 5 United Kingdom          2012            5                 1461.      9
 6 United Kingdom          2013            4.9               1477.      9
 7 United Kingdom          2014            5                 1517.      9
 8 United Kingdom          2015            5.2               1624.      9
 9 United Kingdom          2016            5.3               1674       9
10 United Kingdom          2017            5.2               1657.      9
# ℹ 644 more rows
  • We observe that mutate() keeps all the rows of the original data frame.
  • It assigns the same value (the number of observations) to all group rows.
  • And, unlike summarize(), maintains the grouping.

Using mutate() with groups

ict |>
  dplyr::group_by(geo) |>
  dplyr::mutate(nobs = dplyr::n()) |>
  dplyr::ungroup()
# A tibble: 654 × 5
   geo      year ict_percentage ict_thousand_persons  nobs
   <fct>   <int>          <dbl>                <dbl> <int>
 1 Austria  2004            2.9                 106.    20
 2 Austria  2005            3.1                 116     20
 3 Austria  2006            3.3                 126     20
 4 Austria  2007            3.3                 128.    20
 5 Austria  2008            3.3                 133.    20
 6 Austria  2009            3.5                 139.    20
 7 Austria  2010            3.5                 141.    20
 8 Austria  2011            3.6                 145     20
 9 Austria  2012            3.6                 147     20
10 Austria  2013            3.7                 153.    20
# ℹ 644 more rows
  • To drop grouping, we need to manually call ungroup().

6.6 Grouped mutate() vs. summarize()

Action mutate() with grouping summarize()
Transformation Per group Per group
Assignment One value per group One value per group
Result rows All rows One row per group
Result columns All columns Grouped columns
Automated ungrouping No Yes

6.7 Other summarizing transformations

  • Summarizing transformations go beyond calculating the number of observations.
  • We can calculate means, medians, standard deviations, and other summary statistics with similar approaches.
  • Using summarize() and collapsing each group into a single row.
  • Or using mutate() and keeping all rows.

6.8 Basic summary statistics

  • For example, we can calculate the average percentage of employment in the ICT sector for each country using the mean() function.

Basic summary statistics

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(avg = mean(ict_percentage))
# A tibble: 37 × 2
   geo                      avg
   <fct>                  <dbl>
 1 Austria                 3.90
 2 Bosnia and Herzegovina  1.73
 3 Belgium                 4.44
 4 Bulgaria                2.82
 5 Switzerland             5.08
 6 Cyprus                  2.94
 7 Czechia                 3.99
 8 Germany                 3.88
 9 Denmark                 4.84
10 Estonia                 4.51
# ℹ 27 more rows
  • Observe that summarize works with base R functions like mean().
  • This is true for other base R functions like median(), sd(), min(), and max().
  • Thus, we can calculate summary statistics for each group in a dataset.

Basic summary statistics

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(
    nobs = dplyr::n(),
    min = min(ict_percentage),
    mean = mean(ict_percentage),
    median = median(ict_percentage),
    sd = sd(ict_percentage),
    max = max(ict_percentage)
  )
# A tibble: 37 × 7
   geo                     nobs   min  mean median    sd   max
   <fct>                  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
 1 Austria                   20   2.9  3.90   3.65 0.653   5.3
 2 Bosnia and Herzegovina     3   1.5  1.73   1.7  0.252   2  
 3 Belgium                   20   3.4  4.44   4.2  0.669   5.6
 4 Bulgaria                  20   2.2  2.82   2.5  0.587   4.3
 5 Switzerland               13   4.3  5.08   5    0.438   5.7
 6 Cyprus                    20   2.2  2.94   2.7  0.810   5.4
 7 Czechia                   20   3.4  3.99   3.95 0.439   4.6
 8 Germany                   20   3.1  3.88   3.7  0.550   5  
 9 Denmark                   20   4.1  4.84   4.8  0.481   5.9
10 Estonia                   20   2.5  4.51   4.1  1.29    6.7
# ℹ 27 more rows

6.9 Custom summary statistics

  • We can also use custom transformations.
ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(
    custom1 = sd(ict_percentage) / median(ict_percentage),
    custom2 = 2 * mean(ict_percentage) - 1
  )
# A tibble: 37 × 3
   geo                    custom1 custom2
   <fct>                    <dbl>   <dbl>
 1 Austria                 0.179     6.81
 2 Bosnia and Herzegovina  0.148     2.47
 3 Belgium                 0.159     7.89
 4 Bulgaria                0.235     4.65
 5 Switzerland             0.0876    9.15
 6 Cyprus                  0.300     4.87
 7 Czechia                 0.111     6.98
 8 Germany                 0.149     6.76
 9 Denmark                 0.100     8.68
10 Estonia                 0.314     8.03
# ℹ 27 more rows

7 Slicing data

  • Another useful data transformation operation is slicing.
  • Similar to filtering, slicing can be used to select a subset of the data.
  • Most slicing operations can also be written as filtering operations.
  • But slicing can be more convenient than filtering on some occasions.

7.1 Slicing by position

  • For example, slicing can be used to get the first \(n\) rows within a group.
ict |>
  dplyr::group_by(geo) |>
  dplyr::arrange(year) |>
  dplyr::slice_head(n=3)
# A tibble: 111 × 4
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons
   <fct>                  <int>          <dbl>                <dbl>
 1 Austria                 2004            2.9                106. 
 2 Austria                 2005            3.1                116  
 3 Austria                 2006            3.3                126  
 4 Bosnia and Herzegovina  2021            1.5                 17.6
 5 Bosnia and Herzegovina  2022            1.7                 19.3
 6 Bosnia and Herzegovina  2023            2                   24.3
 7 Belgium                 2004            3.4                143. 
 8 Belgium                 2005            3.5                147. 
 9 Belgium                 2006            3.7                156. 
10 Bulgaria                2004            2.2                 63.8
# ℹ 101 more rows

Slicing by position

  • Or to get the last \(n\) rows within a group.
ict |>
  dplyr::group_by(geo) |>
  dplyr::arrange(year) |>
  dplyr::slice_tail(n=3)
# A tibble: 111 × 4
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons
   <fct>                  <int>          <dbl>                <dbl>
 1 Austria                 2021            4.5                192. 
 2 Austria                 2022            5                  221. 
 3 Austria                 2023            5.3                237. 
 4 Bosnia and Herzegovina  2021            1.5                 17.6
 5 Bosnia and Herzegovina  2022            1.7                 19.3
 6 Bosnia and Herzegovina  2023            2                   24.3
 7 Belgium                 2021            5.6                272. 
 8 Belgium                 2022            5.6                278. 
 9 Belgium                 2023            5.4                273. 
10 Bulgaria                2021            3.5                108  
# ℹ 101 more rows

7.2 Other slicing operations

  • Besides slice_head() and slice_tail(), there are other slicing operations available in dplyr.
  • slice_sample(n = 1): Randomly selects \(n\) rows from each group.
  • slice_min(n = 1): Returns the rows with the \(n\) smallest values of a variable in a group.
  • slice_max(n = 1): Returns the rows with the \(n\) largest values of a variable in a group.

7.3 Slicing vs. filtering

ict |>
  dplyr::group_by(geo) |>
  dplyr::filter(ict_percentage == max(ict_percentage))
ict |>
  dplyr::group_by(geo) |>
  dplyr::slice_max(ict_percentage)

8 Accessing columns

  • On certain occasions, we want to access a column of a data frame and perform some operations on it beyond the dplyr pipeline.
  • There are multiple ways we can access a column(s) of a data frame.
  1. Using the pull() function from the dplyr package.
  2. Using the $ operator.
  3. Using the [[ operator.
  4. Using the [ operator (for multiple columns).

8.1 Pulling a column

  • One approach, which we have already seen, is to use the pull() function from the dplyr package.
pulled_col <- gdp |>
  dplyr::pull(geo)
  • The variable pulled_col is a vector containing the values of the geo column of the gdp data frame.
  • Its length is equal to the number of rows of the gdp data frame.

Pulling a column

  • Its length is equal to the number of rows of the gdp data frame.
  • We can verify this by using the length() function from base R.
length(pulled_col) == nrow(gdp)
[1] TRUE
  • Verifying that pulled_col is identical to the geo column requires a bit more work.

8.2 Programming digression: all and any

  • In R, using one of the logical comparison operators, such as ==, to compare two vectors returns a logical vector.
v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

v1 <= v2
[1] TRUE TRUE TRUE
v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

v3 == v4
[1]  TRUE FALSE
  • Each element of the resulting vector is the result of the comparison of the corresponding elements of the input vectors.

Programming digression: all and any

  • Occasionally, we want to check if a logical condition is elementwise satisfied for all elements of two vectors.
  • The all() function from base R accepts a logical vector and returns TRUE if all elements are TRUE.
v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

all(v1 <= v2)
[1] TRUE
v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

all(v3 == v4)
[1] FALSE

Programming digression: all and any

  • The any() function from base R accepts a logical vector and returns TRUE if at least one element is TRUE.
v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

any(v1 <= v2)
[1] TRUE
v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

any(v3 == v4)
[1] TRUE

8.3 Accessing a column via $

  • The dollar operator $ is a special R operator acting on vectors, lists, and data frames to extract or replace parts.
  • In the case of data frames, we can use $ to directly access the values of a column as a vector.
  • For example,
gdp$geo

Accessing a column via $

  • We can now verify that the pulled_col extracted with pull() is identical to the geo column of gdp.
all(pulled_col == gdp$geo)
[1] TRUE

8.4 Accessing a column via [[

  • Another way to access list or data frame parts is by using the double bracket [[ indexing operator.
  • For example,
gdp[["geo"]]

Accessing a column via [[

  • We can also verify that all three ways we have seen give equivalent results.
all(pulled_col == gdp[["geo"]])
[1] TRUE

8.5 Accessing columns via [

  • What if we want to access multiple columns at once?
  • We already know that we can use dplyr’s select() to select multiple columns.
  • Additionally, we can use the [ operator to access multiple columns of a data frame.

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
gdp[c("geo", "year")]
# A tibble: 36 × 2
   geo                                    year
   <chr>                                 <int>
 1 Austria                                2023
 2 Belgium                                2023
 3 Bulgaria                               2023
 4 Switzerland                            2023
 5 Cyprus                                 2023
 6 Czechia                                2023
 7 Germany                                2023
 8 Denmark                                2023
 9 Euro area - 19 countries  (2015-2022)  2023
10 Euro area – 20 countries (from 2023)   2023
# ℹ 26 more rows

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
  • We can verify that the result is equivalent to using select().
all(
  gdp[c("geo", "year")] == gdp |> dplyr::select(geo, year)
)
[1] TRUE

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
  • The [ operator can also be used to access a single column, but it returns a data frame instead of a vector.
gdp["geo"]
# A tibble: 36 × 1
   geo                                  
   <chr>                                
 1 Austria                              
 2 Belgium                              
 3 Bulgaria                             
 4 Switzerland                          
 5 Cyprus                               
 6 Czechia                              
 7 Germany                              
 8 Denmark                              
 9 Euro area - 19 countries  (2015-2022)
10 Euro area – 20 countries (from 2023) 
# ℹ 26 more rows

Accessing columns via [

  • Additionally, we can use the [ operator to access multiple columns of a data frame.
  • The [ operator can also be used to access a single column, but it returns a data frame instead of a vector.
  • Caution is advised when comparing against a vector:
all(gdp["geo"] == gdp$geo)
[1] TRUE
length(gdp["geo"]) == length(gdp$geo)
[1] FALSE
dim(gdp["geo"]) == dim(gdp$geo)
logical(0)

8.6 Programming digression: Set operations

  • We have already seen a method of getting the distinct values of a column via dplyr’s distinct().
  • We can achieve the same result using unique() from base R.
gdp_geo <- unique(gdp$geo)
  • And we can check that the two approaches give the same result.
y <- gdp |> dplyr::distinct(geo) |> dplyr::pull()
all(y == gdp_geo)
[1] TRUE

Programming digression: set operations

  • When working with vectors, it can be helpful to have some basic set operations in mind.
  • One of the commonly used operations is checking whether a vector contains an element.
  • This can be syntactically expressed using the %in% operation.
"USA" %in% gdp_geo
[1] FALSE
"Germany" %in% gdp_geo
[1] TRUE

Programming digression: set operations

  • We can take this one step further and examine whether all values in a vector belong in another vector.
  • In terms of set jargon, we want to check if a set \(S_{1}\) is a subset of another set \(S_{2}\).
  • This is true if and only if all elements of \(S_{1}\) belong in \(S_{2}\).

Programming digression: set operations

  • Using %in% with two vectors returns a new boolean vector.
x <- c("USA", "Germany")
x %in% gdp_geo
[1] FALSE  TRUE
  • The resulting vector has a length equal to the length of the vector on the left-hand side of %in%.
  • Using %in% with a vector on the left-hand side checks elementwise if the values of the left vector can be found in the values of the right vector.

Programming digression: set operations

  • With this in mind, we can combine %in% with all to examine if a set is a subset of another.
x <- c("USA", "Germany")
all(x %in% gdp_geo)
[1] FALSE
  • How can we figure out which values of the left vector do not appear on the right?

Programming digression: set operations

  • How can we figure out which values of the left vector do not appear on the right?
  • Base R has a function setdiff() that does exactly this.
x <- c("USA", "Germany")
setdiff(x, gdp_geo)
[1] "USA"
  • Calling setdiff() returns the values of the first argument that do not appear in the second argument.

Programming digression: set operations

  • The order of the arguments is important in this case because the set difference operation is not symmetric.
x <- c("USA", "Germany")
y <- c("France", "Germany")
setdiff(x, y)
[1] "USA"
x <- c("USA", "Germany")
y <- c("France", "Germany")
setdiff(y, x)
[1] "France"

Programming digression: set operations

  • We use the intersect() function to find the common elements of two vectors.
  • Unlike setdiff(), the order of the arguments does not matter because the intersection operation is symmetric.
x <- c("USA", "Germany")
y <- c("France", "Germany")
intersect(x, y)
[1] "Germany"
x <- c("USA", "Germany")
y <- c("France", "Germany")
intersect(y, x)
[1] "Germany"

Programming digression: set operations

  • To get the combination of unique values, we use union().
  • Unions are symmetric, so any order of the arguments gives the same elements.
  • The order of the elements can be different, though.
x <- c("USA", "Germany")
y <- c("France", "Germany")
union(x, y)
[1] "USA"     "Germany" "France" 
x <- c("USA", "Germany")
y <- c("France", "Germany")
union(y, x)
[1] "France"  "Germany" "USA"    

9 Combining data

  • So far, we have created two data frames.
  • gdp: Contains growth rates and output data for EU and (some) non-EU countries.
  • ict: Contains ICT employment data for EU and (some) non-EU countries.

9.1 Data sources in the wild

  • We have arrived at one of the most important topics in data analysis.
  • While in most educational settings, data is provided in a single unit, in real-world scenarios, data is often scattered across multiple sources.
  • Much of the time in real-world data analysis work is spent on combining information from different sources.

Data sources in the wild

  • The process of combining data sources takes many names in the wild.
  • Sometimes combining data is referred to as merging, joining, or linking.
  • On some occasions, binding, appending, and concatenating are also used.
  • Even worse, the jargon is not unified across fields, and the same term can be used with different meanings.

Data sources in the wild

  • We will follow the join wording for two reasons:
    • First, it can precisely define the exact type of data combination we want to perform.
    • Second, it is synergetic with the semantics of Structured Query Language (SQL).
  • We examine a motivating problem.

9.2 Combining the gdp and ict data frames

  • We examine the columns of the gdp and ict data frames.
  • We can do this using the names() function from base R.
names(gdp)
[1] "geo"    "year"   "flag"   "output" "growth" "state"  "income"
names(ict)
[1] "geo"                  "year"                 "ict_percentage"      
[4] "ict_thousand_persons"
  • We observe that the columns geo and year are common in both data frames.

Combining the gdp and ict data frames

  • We observe that the columns geo and year are common in both data frames.
  • The columns geo and year are common, but their underlying values are not identical.

Combining the gdp and ict data frames

  • We observe that the columns geo and year are common in both data frames.
  • The columns geo and year are common, but their underlying values are not identical.
  • For instance, the two geo columns differences are:
setdiff(unique(gdp$geo), unique(ict$geo))
[1] "Euro area - 19 countries  (2015-2022)"
[2] "Euro area – 20 countries (from 2023)" 
setdiff(unique(ict$geo), unique(gdp$geo))
[1] "Bosnia and Herzegovina" "North Macedonia"        "United Kingdom"        

Combining the gdp and ict data frames

  • This already raises a few questions when combining the two data frames.
  1. Do we want to keep all the rows of the gdp data frame?
  2. Do we want to keep all the rows of the ict data frame?
  3. Do we want to keep only the matching rows of the two data frames?
  4. Do we want to keep all the rows of both data frames?

9.3 Combining gdp and ict: First approach

  1. We want to keep all the rows of the gdp data frame.
  • And add two new columns, ict_percentage and ict_thousand_persons, with the values from the ict data frame.
  • How can we handle the rows in gdp that do not have a corresponding row in ict?

Combining gdp and ict: First approach

  1. We want to keep all the rows of the gdp data frame.
  • And add two new columns, ict_percentage and ict_thousand_persons, with the values from the ict data frame.
  • If the geo and year values of a row in gdp are equal to the geo and year values of a row in ict, then copy the values.
  • If the geo and year values of a row in gdp cannot be found in ict, then assign NA.

Combining gdp and ict: First approach

  • We can achieve this using the left_join() function from the dplyr package.
gdp |>
  dplyr::left_join(ict)
  • In its most basic use case, left_join() takes two data frames, x and y.
  • It then returns a joined data frame that maintains all the rows and columns of x and adds the columns of y that do not exist in x.
  • The matching is done based on the common columns of the two data frames.

Combining gdp and ict: First approach

  • We can achieve this using the left_join() function from the dplyr package.
gdp |>
  dplyr::left_join(ict)
Joining with `by = join_by(geo, year)`
# A tibble: 36 × 9
   geo                      year flag  output growth state income ict_percentage
   <chr>                   <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria                  2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium                  2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria                 2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland              2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus                   2023 p      29080    1   TRUE  middle            5.4
 6 Czechia                  2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany                  2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark                  2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Euro area - 19 countri…  2023 <NA>   32340   -0.3 FALSE middle           NA  
10 Euro area – 20 countri…  2023 <NA>   32150   -0.2 FALSE middle           NA  
# ℹ 26 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

9.4 Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
  • First, we can check the dimensions of the resulting data frame using the dim() function.
gdp_ict_lj <- gdp |>
  dplyr::left_join(ict)
Joining with `by = join_by(geo, year)`
dim(gdp_ict_lj)
[1] 36  9
dim(gdp)
[1] 36  7

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
  • Second, we check that the resulting geo and year columns contain exactly the values of the gdp data frame.
all(gdp_ict_lj[c("geo", "year")] == gdp[c("geo", "year")])
[1] TRUE

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
  • Third, we check that the resulting ict_percentage and ict_thousand_persons columns are NA for the rows that do not have a corresponding row in the ict data frame.
  • To check this, we need to calculate the set difference between the unique combinations of geo and year in the gdp and ict data frames.
  • However, setdiff() works with vectors, not data frames.
  • An easy way to combine these columns is to use the paste() function.
  • Among other things, paste() accepts two or more vectors, converts them to strings, and concatenates them element-wise.
v1 <- c("Hello", "Hi")
v2 <- c("Earth", "Jupiter")
paste(v1, v2)
[1] "Hello Earth" "Hi Jupiter" 

Combining gdp and ict: First approach verification

  • Let us verify that the left_join() function works as expected.
gdp_ict_lj |>
  dplyr::filter(is.na(ict_thousand_persons))
# A tibble: 3 × 9
  geo                       year flag  output growth state income ict_percentage
  <chr>                    <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
1 Euro area - 19 countrie…  2023 <NA>   32340   -0.3 FALSE middle             NA
2 Euro area – 20 countrie…  2023 <NA>   32150   -0.2 FALSE middle             NA
3 Montenegro                2023 p       6900    3.7 TRUE  low                NA
# ℹ 1 more variable: ict_thousand_persons <dbl>
setdiff(paste(gdp$geo, gdp$year), paste(ict$geo, ict$year))
[1] "Euro area - 19 countries  (2015-2022) 2023"
[2] "Euro area – 20 countries (from 2023) 2023" 
[3] "Montenegro 2023"                           

9.5 Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • And add five new columns flag, output, growth, state, and income with the values from the gdp data frame.

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • We can formulate this problem in terms of a left join by swapping gdp and ict in our previous approach.

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
ict |>
  dplyr::left_join(gdp)
Joining with `by = join_by(geo, year)`
# A tibble: 654 × 9
   geo      year ict_percentage ict_thousand_persons flag  output growth state
   <chr>   <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria  2004            2.9                 106. <NA>      NA     NA NA   
 2 Austria  2005            3.1                 116  <NA>      NA     NA NA   
 3 Austria  2006            3.3                 126  <NA>      NA     NA NA   
 4 Austria  2007            3.3                 128. <NA>      NA     NA NA   
 5 Austria  2008            3.3                 133. <NA>      NA     NA NA   
 6 Austria  2009            3.5                 139. <NA>      NA     NA NA   
 7 Austria  2010            3.5                 141. <NA>      NA     NA NA   
 8 Austria  2011            3.6                 145  <NA>      NA     NA NA   
 9 Austria  2012            3.6                 147  <NA>      NA     NA NA   
10 Austria  2013            3.7                 153. <NA>      NA     NA NA   
# ℹ 644 more rows
# ℹ 1 more variable: income <fct>

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • An equivalent way to achieve this is to use the right_join() function.

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
gdp |>
  dplyr::right_join(ict)
Joining with `by = join_by(geo, year)`
# A tibble: 654 × 9
   geo          year flag  output growth state income ict_percentage
   <chr>       <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria      2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium      2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria     2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland  2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus       2023 p      29080    1   TRUE  middle            5.4
 6 Czechia      2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany      2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark      2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Estonia      2023 <NA>   15250   -5.4 TRUE  middle            6.7
10 Greece       2023 p      19460    2.6 TRUE  middle            2.4
# ℹ 644 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining gdp and ict: Second approach

  1. We want to keep all the rows of the ict data frame.
  • Observe, however, that the order of the rows and columns resulting from a left_join() and a right_join() are different.
  • Both left_join() and right_join() maintain the order of the rows and columns of the x argument.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
  • This approach is the most common type of join operation.
  • The left and right join operations introduce NA values in the resulting data frame if there are non-matching rows.
  • Statistical methods often do not work with NA values.
  • Thus, practically, we often want to keep only the matching rows.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
  • We can keep only the matching rows using the inner_join() function.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
gdp |>
  dplyr::inner_join(ict)
Joining with `by = join_by(geo, year)`
# A tibble: 33 × 9
   geo          year flag  output growth state income ict_percentage
   <chr>       <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria      2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium      2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria     2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland  2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus       2023 p      29080    1   TRUE  middle            5.4
 6 Czechia      2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany      2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark      2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Estonia      2023 <NA>   15250   -5.4 TRUE  middle            6.7
10 Greece       2023 p      19460    2.6 TRUE  middle            2.4
# ℹ 23 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
  • Similar to the left_join() and right_join() functions, the inner_join() function maintains the order of the rows and columns of the x argument.

Combining gdp and ict: Third approach

  1. We want to keep only the matching rows of the two data frames.
ict |>
  dplyr::inner_join(gdp)
Joining with `by = join_by(geo, year)`
# A tibble: 33 × 9
   geo        year ict_percentage ict_thousand_persons flag  output growth state
   <chr>     <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria    2023            5.3                237.  <NA>   37860   -1.8 TRUE 
 2 Belgium    2023            5.4                273.  p      37310    0.4 TRUE 
 3 Bulgaria   2023            4.3                126.  <NA>    7900    2.2 TRUE 
 4 Switzerl…  2023            5.7                273   p      63870   -0.8 TRUE 
 5 Cyprus     2023            5.4                 24.7 p      29080    1   TRUE 
 6 Czechia    2023            4.3                218.  <NA>   18480   -1.2 TRUE 
 7 Germany    2023            4.9               2108.  p      36290   -1.1 TRUE 
 8 Denmark    2023            5.9                177.  <NA>   52510    1.8 TRUE 
 9 Estonia    2023            6.7                 46.5 <NA>   15250   -5.4 TRUE 
10 Greece     2023            2.4                100.  p      19460    2.6 TRUE 
# ℹ 23 more rows
# ℹ 1 more variable: income <fct>

Combining gdp and ict: Fourth approach

  1. We want to keep all the rows of both data frames.
  • We can keep all the rows of both data frames using the full_join() function.

Combining gdp and ict: Fourth approach

  1. We want to keep all the rows of both data frames.
ict |>
  dplyr::full_join(gdp)
Joining with `by = join_by(geo, year)`
# A tibble: 657 × 9
   geo      year ict_percentage ict_thousand_persons flag  output growth state
   <chr>   <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria  2004            2.9                 106. <NA>      NA     NA NA   
 2 Austria  2005            3.1                 116  <NA>      NA     NA NA   
 3 Austria  2006            3.3                 126  <NA>      NA     NA NA   
 4 Austria  2007            3.3                 128. <NA>      NA     NA NA   
 5 Austria  2008            3.3                 133. <NA>      NA     NA NA   
 6 Austria  2009            3.5                 139. <NA>      NA     NA NA   
 7 Austria  2010            3.5                 141. <NA>      NA     NA NA   
 8 Austria  2011            3.6                 145  <NA>      NA     NA NA   
 9 Austria  2012            3.6                 147  <NA>      NA     NA NA   
10 Austria  2013            3.7                 153. <NA>      NA     NA NA   
# ℹ 647 more rows
# ℹ 1 more variable: income <fct>