More on data transformation

A follow-up look on data transformation topics. Boolean columns. Discretizing continuous variables. Counting observations. Groupwise operations. Summarizing. Slicing operations. Accessing columns. Combining data with joins.

Published

April 19th, 2026

1 Prerequisites

We examine some aspects of dplyr in more detail.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Prerequisites

We pick up using the data from the data import overview topic.

sdg <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

1.1 Programming digression: `R` namespaces

The snippet for importing the sdg data slightly differs from the code we used in the data import overview topic.

Programming digression: `R` namespaces

sdg <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)

Programming digression: `R` namespaces

We have modified all calls to the readr package’s functions!
- read_csv $\rightarrow$ readr::read_csv
- col_factor $\rightarrow$ readr::col_factor
- col_integer $\rightarrow$ readr::col_integer
- col_datetime $\rightarrow$ readr::col_datetime

However, we have never loaded the readr package!
And the two snippets execute exactly the same underlying operations!

Programming digression: `R` namespaces

Many programming languages, but not all, have a namespace feature.

A namespace is a collection of identifiers, such as function names, variable names, and other symbols, within a single scope (or context).

Programming digression: `R` namespaces

Organizing and managing function and object names in namespaces is a way to avoid conflicts between different modules.

Programming digression: `R` namespaces

Each package in R has its own namespace.
When importing a package with library(), all the functions and objects in the package’s namespace become available globally.
This is why, after library(dplyr), we can use select(), filter(), etc., without specifying the package.

Programming digression: `R` namespaces

Why would one want to use readr::read_csv instead of read_csv?
- After all, read_csv is shorter and easier to type.

Readability: It is easier to understand the code.
Avoiding conflicts: If two packages have functions with the same name, you can specify which one you want to use.

Programming digression: `R` namespaces

Readability: It is easier to understand the code.

First, readr::read_csv informs directly the reader where the read_csv function comes from.
This is relevant both for your coworkers when you share your code with them and for your future self reading the code.
There is no need to scroll up to see which packages are loaded.

Programming digression: `R` namespaces

Avoiding conflicts: If two packages have functions with the same name, you can specify which one you want to use.

Let’s revisit what happens when we load the dplyr package for the first time in an R session.

Programming digression: `R` namespaces

Let’s revisit what happens when we load the dplyr package for the first time in an R session.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Programming digression: `R` namespaces

If instead of loading dplyr, we use its functions with the dplyr:: prefix, we avoid masking functions from other namespaces.

This is less error-prone because we avoid scenarios where we loaded a package and forgot it masks a function from another package we also use.
However, it makes dependency management a bit more cumbersome because the loaded packages are not listed at a single point in the code.

2 Creating boolean variables

We continue from the data transformation overview towards constructing the eu_ict dataset.

A brief reminder of where we left off:

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)

We have used mutate() to modify the values of the unit column.
Now, we will use mutate() to create the new columns.

2.1 Storing an intermediate result

A brief reminder of where we left off:

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)

# A tibble: 36 × 5
   geo                                    year flag  output growth
   <chr>                                 <int> <chr>  <dbl>  <dbl>
 1 Austria                                2023 <NA>   37860   -1.8
 2 Belgium                                2023 p      37310    0.4
 3 Bulgaria                               2023 <NA>    7900    2.2
 4 Switzerland                            2023 p      63870   -0.8
 5 Cyprus                                 2023 p      29080    1  
 6 Czechia                                2023 <NA>   18480   -1.2
 7 Germany                                2023 p      36290   -1.1
 8 Denmark                                2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2
# ℹ 26 more rows

Storing an intermediate result

To facilitate the presentation, we assign the intermediate result to a new variable.

sdg_temp <- sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)

2.2 Adding the `state` boolean variable

We observe that the geo column contains both states and coalitions of states.

We can use the function pull() from dplyr to examine the values of geo.
There are other ways to access the values of a data frame column, but pull() works well with the pipe operator and, hence, stacks well together with other dplyr transformations, such as distinct() and arrange().

Adding the `state` boolean variable

sdg_temp |>
  distinct(geo) |>
  arrange(geo) |>
  pull()

 [1] "Austria"                                  
 [2] "Belgium"                                  
 [3] "Bulgaria"                                 
 [4] "Croatia"                                  
 [5] "Cyprus"                                   
 [6] "Czechia"                                  
 [7] "Denmark"                                  
 [8] "Estonia"                                  
 [9] "Euro area - 19 countries  (2015-2022)"    
[10] "Euro area – 20 countries (from 2023)"     
[11] "European Union - 27 countries (from 2020)"
[12] "Finland"                                  
[13] "France"                                   
[14] "Germany"                                  
[15] "Greece"                                   
[16] "Hungary"                                  
[17] "Iceland"                                  
[18] "Ireland"                                  
[19] "Italy"                                    
[20] "Latvia"                                   
[21] "Lithuania"                                
[22] "Luxembourg"                               
[23] "Malta"                                    
[24] "Montenegro"                               
[25] "Netherlands"                              
[26] "Norway"                                   
[27] "Poland"                                   
[28] "Portugal"                                 
[29] "Romania"                                  
[30] "Serbia"                                   
[31] "Slovakia"                                 
[32] "Slovenia"                                 
[33] "Spain"                                    
[34] "Sweden"                                   
[35] "Switzerland"                              
[36] "Türkiye"

Adding the `state` boolean variable

We want to create a state binary variable with the value TRUE if the observation corresponds to a state, and the value FALSE otherwise.

Since all coalition observations include the string "Euro", we can easily create the new boolean variable using grepl.

Adding the `state` boolean variable

Since all coalition observations include the string "Euro", we can easily create the new boolean variable using grepl.

sdg_temp |>
  mutate(state = !grepl("Euro", geo))

# A tibble: 36 × 6
   geo                                    year flag  output growth state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>
 1 Austria                                2023 <NA>   37860   -1.8 TRUE 
 2 Belgium                                2023 p      37310    0.4 TRUE 
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE 
 4 Switzerland                            2023 p      63870   -0.8 TRUE 
 5 Cyprus                                 2023 p      29080    1   TRUE 
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE 
 7 Germany                                2023 p      36290   -1.1 TRUE 
 8 Denmark                                2023 <NA>   52510    1.8 TRUE 
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE
# ℹ 26 more rows

Adding the `state` boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo))

# A tibble: 36 × 6
   geo                                    year flag  output growth state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>
 1 Austria                                2023 <NA>   37860   -1.8 TRUE 
 2 Belgium                                2023 p      37310    0.4 TRUE 
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE 
 4 Switzerland                            2023 p      63870   -0.8 TRUE 
 5 Cyprus                                 2023 p      29080    1   TRUE 
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE 
 7 Germany                                2023 p      36290   -1.1 TRUE 
 8 Denmark                                2023 <NA>   52510    1.8 TRUE 
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE
# ℹ 26 more rows

The new variable is placed by default as the last column of the data frame.

Adding the `state` boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo), .after = "geo")

# A tibble: 36 × 6
   geo                                   state  year flag  output growth
   <chr>                                 <lgl> <int> <chr>  <dbl>  <dbl>
 1 Austria                               TRUE   2023 <NA>   37860   -1.8
 2 Belgium                               TRUE   2023 p      37310    0.4
 3 Bulgaria                              TRUE   2023 <NA>    7900    2.2
 4 Switzerland                           TRUE   2023 p      63870   -0.8
 5 Cyprus                                TRUE   2023 p      29080    1  
 6 Czechia                               TRUE   2023 <NA>   18480   -1.2
 7 Germany                               TRUE   2023 p      36290   -1.1
 8 Denmark                               TRUE   2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022) FALSE  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)  FALSE  2023 <NA>   32150   -0.2
# ℹ 26 more rows

We can modify this default behavior by passing the argument .after to mutate().

Adding the `state` boolean variable

sdg_temp |>
  mutate(state = !grepl("Euro", geo), .after = "geo")

# A tibble: 36 × 6
   geo                                   state  year flag  output growth
   <chr>                                 <lgl> <int> <chr>  <dbl>  <dbl>
 1 Austria                               TRUE   2023 <NA>   37860   -1.8
 2 Belgium                               TRUE   2023 p      37310    0.4
 3 Bulgaria                              TRUE   2023 <NA>    7900    2.2
 4 Switzerland                           TRUE   2023 p      63870   -0.8
 5 Cyprus                                TRUE   2023 p      29080    1  
 6 Czechia                               TRUE   2023 <NA>   18480   -1.2
 7 Germany                               TRUE   2023 p      36290   -1.1
 8 Denmark                               TRUE   2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022) FALSE  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)  FALSE  2023 <NA>   32150   -0.2
# ℹ 26 more rows

Observe that we have used the exclamation mark operator ! before grepl().

2.3 Programming digression: Combining logical conditions

Observe that we have used the exclamation mark operator ! before grepl().

Logical conditions in R (as in most, if not all, programming languages) can be combined or negated.

Programming digression: Combining logical conditions

The ! operator negates the logical value of the condition.

!TRUE

[1] FALSE

!FALSE

[1] TRUE

!(2 > 1)

[1] FALSE

!grepl("Sanna", "Sanna is here")

[1] FALSE

Programming digression: Combining logical conditions

Observe that we have used the exclamation mark operator ! before grepl().

The call grepl("Euro", geo) returns TRUE if the string "Euro" is found in the geo column and FALSE otherwise.

We instead want a condition that is TRUE when geo is a state, i.e., when "Euro" is not found in the geo column.
Thus, we have to negate the result of grepl("Euro", geo).

Programming digression: Combining logical conditions

Besides negation, logical conditions can be combined to form more complex conditions.

For example, suppose we want to create a new variable that is TRUE when geo is a state and output is above 50,000.
We already know how to check if geo is a state with !grepl("Euro", geo).
We can check if output is above 50,000 with output > 50000.

Programming digression: Combining logical conditions

Besides negation, logical conditions can be combined to form more complex conditions.
For example, suppose we want to create a new variable that is TRUE when geo is a state and output is above 50,000.

Combining these with a logical AND operator gives us the desired condition.

In R, the logical AND operator is &.

Programming digression: Combining logical conditions

Besides negation, logical conditions can be combined to form more complex conditions.

sdg_temp |>
  mutate(rich_state = !grepl("Euro", geo) & output > 50000)

# A tibble: 36 × 6
   geo                                    year flag  output growth rich_state
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>     
 1 Austria                                2023 <NA>   37860   -1.8 FALSE     
 2 Belgium                                2023 p      37310    0.4 FALSE     
 3 Bulgaria                               2023 <NA>    7900    2.2 FALSE     
 4 Switzerland                            2023 p      63870   -0.8 TRUE      
 5 Cyprus                                 2023 p      29080    1   FALSE     
 6 Czechia                                2023 <NA>   18480   -1.2 FALSE     
 7 Germany                                2023 p      36290   -1.1 FALSE     
 8 Denmark                                2023 <NA>   52510    1.8 TRUE      
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE     
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE     
# ℹ 26 more rows

Programming digression: Combining logical conditions

Besides negation, logical conditions can be combined to form more complex conditions.

We can also combine conditions using logical OR operators.
In R, logical OR is denoted with |.

Programming digression: Combining logical conditions

Besides negation, logical conditions can be combined to form more complex conditions.

For example, we can create a logical condition that is TRUE when geo is either Germany or Cyprus.

We check if geo is Germany with geo == "Germany".
We check if geo is Cyprus with geo == "Cyprus".
Combined, we write geo == "Germany" | geo == "Cyprus".

Programming digression: Combining logical conditions

Besides negation, logical conditions can be combined to form more complex conditions.

sdg_temp |>
  mutate(ger_or_cy = geo == "Germany" | geo == "Cyprus")

# A tibble: 36 × 6
   geo                                    year flag  output growth ger_or_cy
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl>    
 1 Austria                                2023 <NA>   37860   -1.8 FALSE    
 2 Belgium                                2023 p      37310    0.4 FALSE    
 3 Bulgaria                               2023 <NA>    7900    2.2 FALSE    
 4 Switzerland                            2023 p      63870   -0.8 FALSE    
 5 Cyprus                                 2023 p      29080    1   TRUE     
 6 Czechia                                2023 <NA>   18480   -1.2 FALSE    
 7 Germany                                2023 p      36290   -1.1 TRUE     
 8 Denmark                                2023 <NA>   52510    1.8 FALSE    
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE    
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE    
# ℹ 26 more rows

3 Discretizing continuous variables

We have extensively used the income variable in the visualization overview topic to color and shape various geometric objects.

The income variable was a categorical variable with three levels: low, middle, and high.
How can we create it?

3.1 Adding the `income` categorical variable: Attempt 1 (non-reusable)

We want to create a factor variable income with three levels: low, middle, and high.

One idea is to pick two output thresholds.
- A low threshold, below which we consider the output as low.
- A high threshold, above which we consider the output as high.
- And use nested ifelse() statements to create the income variable.

Adding the `income` categorical variable: Attempt 1 (non-reusable)

We want to create a factor variable income with three levels: low, middle, and high.

Let us pick 15,200 and 37,400 as the low and high output thresholds, respectively.

if (output > 37400) {
  "high"
} else if (output > 15200) {
  "middle"
} else {
  "low"
}

Adding the `income` categorical variable: Attempt 1 (non-reusable)

We want to create a factor variable income with three levels: low, middle, and high.
Let us pick 15,200 and 37,400 as the low and high output thresholds, respectively.

Or, with inline code:

ifelse(output > 37400, "high", ifelse(output > 15200, "middle", "low"))

Adding the `income` categorical variable: Attempt 1 (non-reusable)

sdg_temp |>
  mutate(
    state = !grepl("Euro", geo),
    income = ifelse(
      output > 37400,
      "high",
      ifelse(output > 15200, "middle", "low")
    )
  )

# A tibble: 36 × 7
   geo                                    year flag  output growth state income
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl> <chr> 
 1 Austria                                2023 <NA>   37860   -1.8 TRUE  high  
 2 Belgium                                2023 p      37310    0.4 TRUE  middle
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE  low   
 4 Switzerland                            2023 p      63870   -0.8 TRUE  high  
 5 Cyprus                                 2023 p      29080    1   TRUE  middle
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE  middle
 7 Germany                                2023 p      36290   -1.1 TRUE  middle
 8 Denmark                                2023 <NA>   52510    1.8 TRUE  high  
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE middle
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE middle
# ℹ 26 more rows

Adding the `income` categorical variable: Attempt 1 (non-reusable)

We have successfully created the income variable.

Nonetheless, this approach is not scalable.

First, we have hard-coded the thresholds.
Second, we have apparently picked the thresholds arbitrarily.
Third, what if we want to create a variable with more levels?
Fourth, what if we want to use the solution with another dataset?

Adding the `income` categorical variable: Attempt 1 (non-reusable)

A better approach is to use the cut() and quantile() functions.

A country is considered low income if its output is below the 25th percentile.
A country is considered high income if its output is above the 75th percentile.
And a country is considered middle income otherwise.

3.2 Using `quantile()`

We can take a look at how to use the quantile() function by accessing its documentation.

?quantile

Using `quantile()`

?quantile

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

Using `quantile()`

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

The x argument in our case is the output variable.

Using `quantile()`

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

We need to use the probs argument to specify the percentiles we want to use.

Using `quantile()`

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

The default value of the probs argument is to set the quartile probabilities $[0, 0.25, 0.5, 0.75, 1]$ (why?).

Using `quantile()`

Usage:

 quantile(x, ...)

 ## Default S3 method:
 quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
          names = TRUE, type = 7, digits = 7, ...)

Arguments:

   x: numeric vector whose sample quantiles are wanted, or an
      object of a class for which a method has been defined (see
      also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
      numeric vectors unless ‘na.rm’ is ‘TRUE’.

   probs: numeric vector of probabilities with values in [0,1].
      (Values up to ‘2e-14’ outside that range are accepted and
      moved to the nearby endpoint.)

We want only the 25th and 75th percentiles as thresholds. Thus, we set probs = c(0, 0.25, 0.75, 1).

Using `quantile()`

This gives the two output thresholds we are after.

quantile(sdg_temp$output, c(0, 0.25, 0.75, 1))

     0%     25%     75%    100% 
 6500.0 15192.5 37447.5 83320.0

We use these thresholds to classify observations according to whether their output belongs to the three induced bins:

$[0, 15192.5]$: low income
$(15192.5, 37447.5]$: middle income
$(37447.5, 83320]$: high income

3.3 Using `cut()`

We can construct this classification using the cut() function.

The cut() function expects a numeric vector and a set of breaks.

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)

 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]

The result is less than stellar.

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)

 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]

We have gotten three levels… but they are not named low, middle, and high.

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)

 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]

Moreover, we have gotten an NA value.

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
)

 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]  (6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] 
[25] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04]  <NA>                (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] 
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]

Relatedly, one of the levels (the lowest) is open on the left, which is not exactly how we wanted it.

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE
)

 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] 
[25] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04]  (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]

We can instruct cut() to include the lower bound in the first bin by setting include.lowest = TRUE.

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE
)

 [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
 [4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
 [7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04] 
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] 
[25] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]  (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04]  [6.5e+03,1.52e+04]  (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] 
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]

This also fixes the NA issue (why?).

Using `cut()`

cut(
  x = sdg_temp$output,
  breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
  include.lowest = TRUE,
  labels = c("low", "middle", "high")
)

 [1] high   middle low    high   middle middle middle high   middle middle
[11] middle middle middle middle middle middle low    low    high   high  
[21] middle middle high   low    low    middle high   high   low    middle
[31] low    low    high   middle middle low   
Levels: low middle high

Finally, we can instruct cut() to label the bins as low, middle, and high by setting the labels argument.

3.4 Adding the `income` categorical variable: Attempt 2 (reusable)

Combining the cut() function with the quantile() function, we can create the income variable.

In this approach, we do not arbitrarily choose and hard-code the thresholds.
And the solution we programmed can be reused if the scale of the output variable changes.

Adding the `income` categorical variable: Attempt 2 (reusable)

Combining the cut() function with the quantile() function, we can create the income variable.

More generally, the combination of cut() and quantile() can be used to discretize continuous variables into categorical ones based on empirical percentiles.

Adding the `income` categorical variable: Attempt 2 (reusable)

sdg_temp |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

# A tibble: 36 × 7
   geo                                    year flag  output growth state income
   <chr>                                 <int> <chr>  <dbl>  <dbl> <lgl> <fct> 
 1 Austria                                2023 <NA>   37860   -1.8 TRUE  high  
 2 Belgium                                2023 p      37310    0.4 TRUE  middle
 3 Bulgaria                               2023 <NA>    7900    2.2 TRUE  low   
 4 Switzerland                            2023 p      63870   -0.8 TRUE  high  
 5 Cyprus                                 2023 p      29080    1   TRUE  middle
 6 Czechia                                2023 <NA>   18480   -1.2 TRUE  middle
 7 Germany                                2023 p      36290   -1.1 TRUE  middle
 8 Denmark                                2023 <NA>   52510    1.8 TRUE  high  
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3 FALSE middle
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2 FALSE middle
# ℹ 26 more rows

4 A summary of the GDP data frame transformations

We summarize here the transformations we applied to the estat_sdg_08_10_en data to create the gdp data frame.

We want to have the code constructing the gdp data frame, from import to final structure, in a single place.
Furthermore, using our current understanding, we apply some improvements to the code to make it more concise, self-contained, and beautiful!

4.1 The `gdp` data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

The `gdp` data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer(),
    `LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

First, we observe that LAST UPDATE is not used in the final or any intermediate structure.

Having its type definition in line read_csv adds only noise when reading the code.

The `gdp` data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

Second, the unit column type is defined in read_csv, but, because it is used as the pivoting variable in pivot_wider, it does not appear in the final structure.

Instead, the categorical variable flag (originally OBS_FLAG), which is used in the final structure, is never assigned a type.

The `gdp` data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    OBS_FLAG = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

Third, we can use dplyr:: to call the dplyr functions directly, informing the reader where the functions come from and avoiding loading the entire dplyr namespace and masking other functions.

The `gdp` data frame

gdp <- readr::read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    OBS_FLAG = readr::col_factor(),
    TIME_PERIOD = readr::col_integer()
  )
) |>
  dplyr::select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  dplyr::rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  dplyr::mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  dplyr::filter(year == 2023) |>
  tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
  dplyr::mutate(
    state = !grepl("Euro", geo),
    income = cut(
      x = output,
      breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
      include.lowest = TRUE,
      labels = c("low", "middle", "high")
    )
  )

5 The `ict` data frame

Starting from estat_isoc_sks_itspt_en.csv, we can apply similar transformations to create the ict data frame (How?).

5.1 A view at the `ict` data

5.2 The `ict` times series

When working on creating the gdp data frame, we have filtered the data to keep only the year 2023.

In the ict data frame, we have kept the time dimension of the data.
For each country, we have multiple observations.
How many observations do we have for each country?

5.3 Counting observations

How many observations do we have for each country?

We can use the count() function to answer the question.

Similar to other dplyr functions, count() takes the data frame as its first argument.
We can then specify the variables in the data frame that dictate our counting logic.
The result of count() gives the number of observations for each unique combination of values in the specified variables.

Counting observations

How many observations do we have for each country?

ict |>
  dplyr::count(geo)

# A tibble: 37 × 2
   geo                        n
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows

Count selects the geo variable.

Filters out the duplicate rows of the selected variables.

Creates a new column n of integer type.

And fills it with the number of observations for each value of geo.

Counting observations

If we call count() without specifying any variables, we get the total number of observations in the data frame.

ict |>
  dplyr::count()

# A tibble: 1 × 1
      n
  <int>
1   654

Counting observations

We can also use count() with more than one variable.

ict |>
  dplyr::mutate(after_2020 = year > 2020) |>
  dplyr::count(geo, after_2020)

# A tibble: 70 × 3
   geo                    after_2020     n
   <fct>                  <lgl>      <int>
 1 Austria                FALSE         17
 2 Austria                TRUE           3
 3 Bosnia and Herzegovina TRUE           3
 4 Belgium                FALSE         17
 5 Belgium                TRUE           3
 6 Bulgaria               FALSE         17
 7 Bulgaria               TRUE           3
 8 Switzerland            FALSE         10
 9 Switzerland            TRUE           3
10 Cyprus                 FALSE         17
# ℹ 60 more rows

Counting observations

The transformation of count() is a group-wise operation.

Behind the scenes, count() groups the data frame by the specified variables.
Counting is a group-wise operation that often appears in data science applications, but is not the only one.

6 Group-Wise operations

We take a closer look at group-wise transformations.

We start with manually replicating the behavior of count() to get a first idea of how grouping works.
Subsequently, we will examine more group-wise transformations.

6.1 Manually replicating `count()`

We want to group the data by country and count the number of observations for each country.

First, we need to instruct dplyr that we want to group the data by country.
We can use the group_by() function.

6.2 Using `group_by()`

We can use the group_by() function.

Similar to other dplyr functions, group_by() takes the data frame as its first argument.

We can then specify the variables in the data frame based on which we want to create groups.

Using `group_by()`

ict |>
  dplyr::group_by(geo)

# A tibble: 654 × 4
# Groups:   geo [37]
   geo      year ict_percentage ict_thousand_persons
   <fct>   <int>          <dbl>                <dbl>
 1 Austria  2004            2.9                 106.
 2 Austria  2005            3.1                 116 
 3 Austria  2006            3.3                 126 
 4 Austria  2007            3.3                 128.
 5 Austria  2008            3.3                 133.
 6 Austria  2009            3.5                 139.
 7 Austria  2010            3.5                 141.
 8 Austria  2011            3.6                 145 
 9 Austria  2012            3.6                 147 
10 Austria  2013            3.7                 153.
# ℹ 644 more rows

We observe a new line in the data frame standard output, informing us that the data frame is grouped.

The grouping variable is geo.

And there are in total 37 groups in the data.

Other than grouping, the data frame did not change.

6.3 Transformations per group

As long as a data frame is grouped, we can execute transformation operations within each group instead of on the complete data frame at once.

If we instruct dplyr to count observations on a grouped data frame, it counts the number of observations in each group.

Transformations per group

The most usual grouped operation is to create a summary.

A summary calculates a summary statistic per group.
It reduces the data frame such that we get a single row for each group.
And assigns the calculated statistics to groups.

6.4 Using `summarize()`

The most usual grouped operation is to create a summary.

In dplyr, summarizing operations are performed via summarize().

The basic calling interface of summarize() is similar to that of mutate().
The first argument is a data frame.
Then, we specify a series of summarizing transformations that create new columns.

Using `summarize()`

The most usual grouped operation is to create a summary.
In dplyr, summarizing operations are performed via summarize().

In contrast to mutate(), summarize() returns a new data frame containing
- the grouping variables,
- the newly created summarizing variables, and
- only one row per group.

Using `summarize()`

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n())

# A tibble: 37 × 2
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows

We use dplyr’s function n() as the summarizing statistic in this example.

The function n() counts the number of observations in each group.

We assign the result of n() to a new variable nobs.

Observe that, after calling summarize(), the resulting data frame is not grouped.

Using `summarize()`

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n())

In most cases, the default behavior of summarize() is to ungroup the most nested grouping variable.

Since, in this example, the only grouping variable is geo, removing grouping based on geo results in an ungrouped data frame.
We can override the default behavior by passing the argument .groups = "keep" to summarize().

Using `summarize()`

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n(), .groups = "keep")

# A tibble: 37 × 2
# Groups:   geo [37]
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows

Using `summarize()`

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(nobs = dplyr::n(), .groups = "keep")

# A tibble: 37 × 2
# Groups:   geo [37]
   geo                     nobs
   <fct>                  <int>
 1 Austria                   20
 2 Bosnia and Herzegovina     3
 3 Belgium                   20
 4 Bulgaria                  20
 5 Switzerland               13
 6 Cyprus                    20
 7 Czechia                   20
 8 Germany                   20
 9 Denmark                   20
10 Estonia                   20
# ℹ 27 more rows

In addition, summarize() returns only the grouping and newly created variables.

Using `summarize()`

In addition, summarize() returns only the grouping and newly created variables.

What if we want to all retain variables and assign the corresponding number of observations within each group in a new column?

6.5 Using `mutate()` with groups

We want to retain all variables and assign the corresponding number of observations within each group in a new column.

In such a case, we can combine group_by(), mutate(), and n() to achieve the desired result.

Using `mutate()` with groups

We want to retain all variables and assign the corresponding number of observations within each group in a new column.

In a grouped data frame, mutate() applies its transformations within each group.
It creates the corresponding new variables.
Assigns the transformation results for each group to the new variables.
And retains all other variables in the data frame.

Using `mutate()` with groups

ict |>
  dplyr::group_by(geo) |>
  dplyr::mutate(nobs = dplyr::n()) |>
  dplyr::arrange(nobs)

# A tibble: 654 × 5
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons  nobs
   <fct>                  <int>          <dbl>                <dbl> <int>
 1 Bosnia and Herzegovina  2021            1.5                 17.6     3
 2 Bosnia and Herzegovina  2022            1.7                 19.3     3
 3 Bosnia and Herzegovina  2023            2                   24.3     3
 4 United Kingdom          2011            4.8               1392.      9
 5 United Kingdom          2012            5                 1461.      9
 6 United Kingdom          2013            4.9               1477.      9
 7 United Kingdom          2014            5                 1517.      9
 8 United Kingdom          2015            5.2               1624.      9
 9 United Kingdom          2016            5.3               1674       9
10 United Kingdom          2017            5.2               1657.      9
# ℹ 644 more rows

We observe that mutate() keeps all the rows of the original data frame.

It assigns the same value (the number of observations) to all group rows.

And, unlike summarize(), maintains the grouping.

Using `mutate()` with groups

ict |>
  dplyr::group_by(geo) |>
  dplyr::mutate(nobs = dplyr::n()) |>
  dplyr::ungroup()

# A tibble: 654 × 5
   geo      year ict_percentage ict_thousand_persons  nobs
   <fct>   <int>          <dbl>                <dbl> <int>
 1 Austria  2004            2.9                 106.    20
 2 Austria  2005            3.1                 116     20
 3 Austria  2006            3.3                 126     20
 4 Austria  2007            3.3                 128.    20
 5 Austria  2008            3.3                 133.    20
 6 Austria  2009            3.5                 139.    20
 7 Austria  2010            3.5                 141.    20
 8 Austria  2011            3.6                 145     20
 9 Austria  2012            3.6                 147     20
10 Austria  2013            3.7                 153.    20
# ℹ 644 more rows

To drop grouping, we need to manually call ungroup().

6.6 Grouped `mutate()` vs. `summarize()`

Action	`mutate()` with grouping	`summarize()`
Transformation	Per group	Per group
Assignment	One value per group	One value per group
Result rows	All rows	One row per group
Result columns	All columns	Grouped columns
Automated ungrouping	No	Yes

6.7 Other summarizing transformations

Summarizing transformations go beyond calculating the number of observations.

We can calculate means, medians, standard deviations, and other summary statistics with similar approaches.
Using summarize() and collapsing each group into a single row.
Or using mutate() and keeping all rows.

6.8 Basic summary statistics

For example, we can calculate the average percentage of employment in the ICT sector for each country using the mean() function.

Basic summary statistics

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(avg = mean(ict_percentage))

# A tibble: 37 × 2
   geo                      avg
   <fct>                  <dbl>
 1 Austria                 3.90
 2 Bosnia and Herzegovina  1.73
 3 Belgium                 4.44
 4 Bulgaria                2.82
 5 Switzerland             5.08
 6 Cyprus                  2.94
 7 Czechia                 3.99
 8 Germany                 3.88
 9 Denmark                 4.84
10 Estonia                 4.51
# ℹ 27 more rows

Observe that summarize works with base R functions like mean().

This is true for other base R functions like median(), sd(), min(), and max().

Thus, we can calculate summary statistics for each group in a dataset.

Basic summary statistics

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(
    nobs = dplyr::n(),
    min = min(ict_percentage),
    mean = mean(ict_percentage),
    median = median(ict_percentage),
    sd = sd(ict_percentage),
    max = max(ict_percentage)
  )

# A tibble: 37 × 7
   geo                     nobs   min  mean median    sd   max
   <fct>                  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
 1 Austria                   20   2.9  3.90   3.65 0.653   5.3
 2 Bosnia and Herzegovina     3   1.5  1.73   1.7  0.252   2  
 3 Belgium                   20   3.4  4.44   4.2  0.669   5.6
 4 Bulgaria                  20   2.2  2.82   2.5  0.587   4.3
 5 Switzerland               13   4.3  5.08   5    0.438   5.7
 6 Cyprus                    20   2.2  2.94   2.7  0.810   5.4
 7 Czechia                   20   3.4  3.99   3.95 0.439   4.6
 8 Germany                   20   3.1  3.88   3.7  0.550   5  
 9 Denmark                   20   4.1  4.84   4.8  0.481   5.9
10 Estonia                   20   2.5  4.51   4.1  1.29    6.7
# ℹ 27 more rows

6.9 Custom summary statistics

We can also use custom transformations.

ict |>
  dplyr::group_by(geo) |>
  dplyr::summarize(
    custom1 = sd(ict_percentage) / median(ict_percentage),
    custom2 = 2 * mean(ict_percentage) - 1
  )

# A tibble: 37 × 3
   geo                    custom1 custom2
   <fct>                    <dbl>   <dbl>
 1 Austria                 0.179     6.81
 2 Bosnia and Herzegovina  0.148     2.47
 3 Belgium                 0.159     7.89
 4 Bulgaria                0.235     4.65
 5 Switzerland             0.0876    9.15
 6 Cyprus                  0.300     4.87
 7 Czechia                 0.111     6.98
 8 Germany                 0.149     6.76
 9 Denmark                 0.100     8.68
10 Estonia                 0.314     8.03
# ℹ 27 more rows

7 Slicing data

Another useful data transformation operation is slicing.

Similar to filtering, slicing can be used to select a subset of the data.
Most slicing operations can also be written as filtering operations.
But slicing can be more convenient than filtering on some occasions.

7.1 Slicing by position

For example, slicing can be used to get the first $n$ rows within a group.

ict |>
  dplyr::group_by(geo) |>
  dplyr::arrange(year) |>
  dplyr::slice_head(n=3)

# A tibble: 111 × 4
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons
   <fct>                  <int>          <dbl>                <dbl>
 1 Austria                 2004            2.9                106. 
 2 Austria                 2005            3.1                116  
 3 Austria                 2006            3.3                126  
 4 Bosnia and Herzegovina  2021            1.5                 17.6
 5 Bosnia and Herzegovina  2022            1.7                 19.3
 6 Bosnia and Herzegovina  2023            2                   24.3
 7 Belgium                 2004            3.4                143. 
 8 Belgium                 2005            3.5                147. 
 9 Belgium                 2006            3.7                156. 
10 Bulgaria                2004            2.2                 63.8
# ℹ 101 more rows

Slicing by position

Or to get the last $n$ rows within a group.

ict |>
  dplyr::group_by(geo) |>
  dplyr::arrange(year) |>
  dplyr::slice_tail(n=3)

# A tibble: 111 × 4
# Groups:   geo [37]
   geo                     year ict_percentage ict_thousand_persons
   <fct>                  <int>          <dbl>                <dbl>
 1 Austria                 2021            4.5                192. 
 2 Austria                 2022            5                  221. 
 3 Austria                 2023            5.3                237. 
 4 Bosnia and Herzegovina  2021            1.5                 17.6
 5 Bosnia and Herzegovina  2022            1.7                 19.3
 6 Bosnia and Herzegovina  2023            2                   24.3
 7 Belgium                 2021            5.6                272. 
 8 Belgium                 2022            5.6                278. 
 9 Belgium                 2023            5.4                273. 
10 Bulgaria                2021            3.5                108  
# ℹ 101 more rows

7.2 Other slicing operations

Besides slice_head() and slice_tail(), there are other slicing operations available in dplyr.

slice_sample(n = 1): Randomly selects $n$ rows from each group.
slice_min(n = 1): Returns the rows with the $n$ smallest values of a variable in a group.
slice_max(n = 1): Returns the rows with the $n$ largest values of a variable in a group.

7.3 Slicing vs. filtering

ict |>
  dplyr::group_by(geo) |>
  dplyr::filter(ict_percentage == max(ict_percentage))

ict |>
  dplyr::group_by(geo) |>
  dplyr::slice_max(ict_percentage)

8 Accessing columns

On certain occasions, we want to access a column of a data frame and perform some operations on it beyond the dplyr pipeline.

There are multiple ways we can access a column(s) of a data frame.

Using the pull() function from the dplyr package.
Using the $ operator.
Using the [[ operator.
Using the [ operator (for multiple columns).

8.1 Pulling a column

One approach, which we have already seen, is to use the pull() function from the dplyr package.

pulled_col <- gdp |>
  dplyr::pull(geo)

The variable pulled_col is a vector containing the values of the geo column of the gdp data frame.
Its length is equal to the number of rows of the gdp data frame.

Pulling a column

Its length is equal to the number of rows of the gdp data frame.

We can verify this by using the length() function from base R.

length(pulled_col) == nrow(gdp)

[1] TRUE

Verifying that pulled_col is identical to the geo column requires a bit more work.

8.2 Programming digression: `all` and `any`

In R, using one of the logical comparison operators, such as ==, to compare two vectors returns a logical vector.

v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

v1 <= v2

[1] TRUE TRUE TRUE

v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

v3 == v4

[1]  TRUE FALSE

Each element of the resulting vector is the result of the comparison of the corresponding elements of the input vectors.

Programming digression: `all` and `any`

Occasionally, we want to check if a logical condition is elementwise satisfied for all elements of two vectors.

The all() function from base R accepts a logical vector and returns TRUE if all elements are TRUE.

v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

all(v1 <= v2)

[1] TRUE

v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

all(v3 == v4)

[1] FALSE

Programming digression: `all` and `any`

The any() function from base R accepts a logical vector and returns TRUE if at least one element is TRUE.

v1 <- c(1, 2, 3)
v2 <- c(2, 4, 10)

any(v1 <= v2)

[1] TRUE

v3 <- c("hello", "earth")
v4 <- c("hello", "mars")

any(v3 == v4)

[1] TRUE

8.3 Accessing a column via `$`

The dollar operator $ is a special R operator acting on vectors, lists, and data frames to extract or replace parts.

In the case of data frames, we can use $ to directly access the values of a column as a vector.

For example,

gdp$geo

Accessing a column via `$`

We can now verify that the pulled_col extracted with pull() is identical to the geo column of gdp.

all(pulled_col == gdp$geo)

[1] TRUE

8.4 Accessing a column via `[[`

Another way to access list or data frame parts is by using the double bracket [[ indexing operator.

For example,

gdp[["geo"]]

Accessing a column via `[[`

We can also verify that all three ways we have seen give equivalent results.

all(pulled_col == gdp[["geo"]])

[1] TRUE

8.5 Accessing columns via `[`

What if we want to access multiple columns at once?

We already know that we can use dplyr’s select() to select multiple columns.
Additionally, we can use the [ operator to access multiple columns of a data frame.

Accessing columns via `[`

Additionally, we can use the [ operator to access multiple columns of a data frame.

gdp[c("geo", "year")]

# A tibble: 36 × 2
   geo                                    year
   <chr>                                 <int>
 1 Austria                                2023
 2 Belgium                                2023
 3 Bulgaria                               2023
 4 Switzerland                            2023
 5 Cyprus                                 2023
 6 Czechia                                2023
 7 Germany                                2023
 8 Denmark                                2023
 9 Euro area - 19 countries  (2015-2022)  2023
10 Euro area – 20 countries (from 2023)   2023
# ℹ 26 more rows

Accessing columns via `[`

Additionally, we can use the [ operator to access multiple columns of a data frame.

We can verify that the result is equivalent to using select().

all(
  gdp[c("geo", "year")] == gdp |> dplyr::select(geo, year)
)

[1] TRUE

Accessing columns via `[`

Additionally, we can use the [ operator to access multiple columns of a data frame.

The [ operator can also be used to access a single column, but it returns a data frame instead of a vector.

gdp["geo"]

# A tibble: 36 × 1
   geo                                  
   <chr>                                
 1 Austria                              
 2 Belgium                              
 3 Bulgaria                             
 4 Switzerland                          
 5 Cyprus                               
 6 Czechia                              
 7 Germany                              
 8 Denmark                              
 9 Euro area - 19 countries  (2015-2022)
10 Euro area – 20 countries (from 2023) 
# ℹ 26 more rows

Accessing columns via `[`

Additionally, we can use the [ operator to access multiple columns of a data frame.
The [ operator can also be used to access a single column, but it returns a data frame instead of a vector.

Caution is advised when comparing against a vector:

all(gdp["geo"] == gdp$geo)

[1] TRUE

length(gdp["geo"]) == length(gdp$geo)

[1] FALSE

dim(gdp["geo"]) == dim(gdp$geo)

logical(0)

8.6 Programming digression: set operations

We have already seen a method of getting the distinct values of a column via dplyr’s distinct().

We can achieve the same result using unique() from base R.

gdp_geo <- unique(gdp$geo)

And we can check that the two approaches give the same result.

y <- gdp |> dplyr::distinct(geo) |> dplyr::pull()
all(y == gdp_geo)

[1] TRUE

Programming digression: set operations

When working with vectors, it can be helpful to have some basic set operations in mind.

One of the commonly used operations is checking whether a vector contains an element.
This can be syntactically expressed using the %in% operation.

"USA" %in% gdp_geo

[1] FALSE

"Germany" %in% gdp_geo

[1] TRUE

Programming digression: set operations

We can take this one step further and examine whether all values in a vector belong in another vector.

In terms of set jargon, we want to check if a set $S_{1}$ is a subset of another set $S_{2}$.
This is true if and only if all elements of $S_{1}$ belong in $S_{2}$.

Programming digression: set operations

Using %in% with two vectors returns a new boolean vector.

x <- c("USA", "Germany")
x %in% gdp_geo

[1] FALSE  TRUE

The resulting vector has a length equal to the length of the vector on the left-hand side of %in%.
Using %in% with a vector on the left-hand side checks elementwise if the values of the left vector can be found in the values of the right vector.

Programming digression: set operations

With this in mind, we can combine %in% with all to examine if a set is a subset of another.

x <- c("USA", "Germany")
all(x %in% gdp_geo)

[1] FALSE

How can we figure out which values of the left vector do not appear on the right?

Programming digression: set operations

How can we figure out which values of the left vector do not appear on the right?

Base R has a function setdiff() that does exactly this.

x <- c("USA", "Germany")
setdiff(x, gdp_geo)

[1] "USA"

Calling setdiff() returns the values of the first argument that do not appear in the second argument.

Programming digression: set operations

The order of the arguments is important in this case because the set difference operation is not symmetric.

x <- c("USA", "Germany")
y <- c("France", "Germany")
setdiff(x, y)

[1] "USA"

x <- c("USA", "Germany")
y <- c("France", "Germany")
setdiff(y, x)

[1] "France"

Programming digression: set operations

We use the intersect() function to find the common elements of two vectors.

Unlike setdiff(), the order of the arguments does not matter because the intersection operation is symmetric.

x <- c("USA", "Germany")
y <- c("France", "Germany")
intersect(x, y)

[1] "Germany"

x <- c("USA", "Germany")
y <- c("France", "Germany")
intersect(y, x)

[1] "Germany"

Programming digression: set operations

To get the combination of unique values, we use union().

Unions are symmetric, so any order of the arguments gives the same elements.
The order of the elements can be different, though.

x <- c("USA", "Germany")
y <- c("France", "Germany")
union(x, y)

[1] "USA"     "Germany" "France"

x <- c("USA", "Germany")
y <- c("France", "Germany")
union(y, x)

[1] "France"  "Germany" "USA"

9 Combining data

So far, we have created two data frames.

gdp: Contains growth rates and output data for EU and (some) non-EU countries.
ict: Contains ICT employment data for EU and (some) non-EU countries.

9.1 Data sources in the wild

We have arrived at one of the most important topics in data analysis.

While in most educational settings, data is provided in a single unit, in real-world scenarios, data is often scattered across multiple sources.
Much of the time in real-world data analysis work is spent on combining information from different sources.

Data sources in the wild

The process of combining data sources takes many names in the wild.

Sometimes combining data is referred to as merging, joining, or linking.
On some occasions, binding, appending, and concatenating are also used.
Even worse, the jargon is not unified across fields, and the same term can be used with different meanings.

Data sources in the wild

We will follow the join wording for two reasons:
- First, it can precisely define the exact type of data combination we want to perform.
- Second, it is synergetic with the semantics of Structured Query Language (SQL).

We examine a motivating problem.

9.2 Combining the `gdp` and `ict` data frames

We examine the columns of the gdp and ict data frames.

We can do this using the names() function from base R.

names(gdp)

[1] "geo"    "year"   "flag"   "output" "growth" "state"  "income"

names(ict)

[1] "geo"                  "year"                 "ict_percentage"      
[4] "ict_thousand_persons"

We observe that the columns geo and year are common in both data frames.

Combining the `gdp` and `ict` data frames

We observe that the columns geo and year are common in both data frames.

The columns geo and year are common, but their underlying values are not identical.

Combining the `gdp` and `ict` data frames

We observe that the columns geo and year are common in both data frames.
The columns geo and year are common, but their underlying values are not identical.

For instance, the two geo columns differences are:

setdiff(unique(gdp$geo), unique(ict$geo))

[1] "Euro area - 19 countries  (2015-2022)"
[2] "Euro area – 20 countries (from 2023)"

setdiff(unique(ict$geo), unique(gdp$geo))

[1] "Bosnia and Herzegovina" "North Macedonia"        "United Kingdom"

Combining the `gdp` and `ict` data frames

This already raises a few questions when combining the two data frames.

Do we want to keep all the rows of the gdp data frame?
Do we want to keep all the rows of the ict data frame?
Do we want to keep only the matching rows of the two data frames?
Do we want to keep all the rows of both data frames?

9.3 Combining `gdp` and `ict`: First approach

We want to keep all the rows of the gdp data frame.

And add two new columns, ict_percentage and ict_thousand_persons, with the values from the ict data frame.
How can we handle the rows in gdp that do not have a corresponding row in ict?

Combining `gdp` and `ict`: First approach

We want to keep all the rows of the gdp data frame.

And add two new columns, ict_percentage and ict_thousand_persons, with the values from the ict data frame.

If the geo and year values of a row in gdp are equal to the geo and year values of a row in ict, then copy the values.

If the geo and year values of a row in gdp cannot be found in ict, then assign NA.

Combining `gdp` and `ict`: First approach

We can achieve this using the left_join() function from the dplyr package.

gdp |>
  dplyr::left_join(ict)

In its most basic use case, left_join() takes two data frames, x and y.
It then returns a joined data frame that maintains all the rows and columns of x and adds the columns of y that do not exist in x.
The matching is done based on the common columns of the two data frames.

Combining `gdp` and `ict`: First approach

We can achieve this using the left_join() function from the dplyr package.

gdp |>
  dplyr::left_join(ict)

Joining with `by = join_by(geo, year)`

# A tibble: 36 × 9
   geo                      year flag  output growth state income ict_percentage
   <chr>                   <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria                  2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium                  2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria                 2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland              2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus                   2023 p      29080    1   TRUE  middle            5.4
 6 Czechia                  2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany                  2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark                  2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Euro area - 19 countri…  2023 <NA>   32340   -0.3 FALSE middle           NA  
10 Euro area – 20 countri…  2023 <NA>   32150   -0.2 FALSE middle           NA  
# ℹ 26 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

9.4 Combining `gdp` and `ict`: First approach verification

Let us verify that the left_join() function works as expected.

First, we can check the dimensions of the resulting data frame using the dim() function.

gdp_ict_lj <- gdp |>
  dplyr::left_join(ict)

Joining with `by = join_by(geo, year)`

dim(gdp_ict_lj)

[1] 36  9

dim(gdp)

[1] 36  7

Combining `gdp` and `ict`: First approach verification

Let us verify that the left_join() function works as expected.

Second, we check that the resulting geo and year columns contain exactly the values of the gdp data frame.

all(gdp_ict_lj[c("geo", "year")] == gdp[c("geo", "year")])

[1] TRUE

Combining `gdp` and `ict`: First approach verification

Let us verify that the left_join() function works as expected.

Third, we check that the resulting ict_percentage and ict_thousand_persons columns are NA for the rows that do not have a corresponding row in the ict data frame.

To check this, we need to calculate the set difference between the unique combinations of geo and year in the gdp and ict data frames.

However, setdiff() works with vectors, not data frames.

An easy way to combine these columns is to use the paste() function.

Among other things, paste() accepts two or more vectors, converts them to strings, and concatenates them element-wise.

v1 <- c("Hello", "Hi")
v2 <- c("Earth", "Jupiter")
paste(v1, v2)

[1] "Hello Earth" "Hi Jupiter"

Combining `gdp` and `ict`: First approach verification

Let us verify that the left_join() function works as expected.

gdp_ict_lj |>
  dplyr::filter(is.na(ict_thousand_persons))

# A tibble: 3 × 9
  geo                       year flag  output growth state income ict_percentage
  <chr>                    <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
1 Euro area - 19 countrie…  2023 <NA>   32340   -0.3 FALSE middle             NA
2 Euro area – 20 countrie…  2023 <NA>   32150   -0.2 FALSE middle             NA
3 Montenegro                2023 p       6900    3.7 TRUE  low                NA
# ℹ 1 more variable: ict_thousand_persons <dbl>

setdiff(paste(gdp$geo, gdp$year), paste(ict$geo, ict$year))

[1] "Euro area - 19 countries  (2015-2022) 2023"
[2] "Euro area – 20 countries (from 2023) 2023" 
[3] "Montenegro 2023"

9.5 Combining `gdp` and `ict`: Second approach

We want to keep all the rows of the ict data frame.

And add five new columns flag, output, growth, state, and income with the values from the gdp data frame.

Combining `gdp` and `ict`: Second approach

We want to keep all the rows of the ict data frame.

We can formulate this problem in terms of a left join by swapping gdp and ict in our previous approach.

Combining `gdp` and `ict`: Second approach

We want to keep all the rows of the ict data frame.

ict |>
  dplyr::left_join(gdp)

Joining with `by = join_by(geo, year)`

# A tibble: 654 × 9
   geo      year ict_percentage ict_thousand_persons flag  output growth state
   <chr>   <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria  2004            2.9                 106. <NA>      NA     NA NA   
 2 Austria  2005            3.1                 116  <NA>      NA     NA NA   
 3 Austria  2006            3.3                 126  <NA>      NA     NA NA   
 4 Austria  2007            3.3                 128. <NA>      NA     NA NA   
 5 Austria  2008            3.3                 133. <NA>      NA     NA NA   
 6 Austria  2009            3.5                 139. <NA>      NA     NA NA   
 7 Austria  2010            3.5                 141. <NA>      NA     NA NA   
 8 Austria  2011            3.6                 145  <NA>      NA     NA NA   
 9 Austria  2012            3.6                 147  <NA>      NA     NA NA   
10 Austria  2013            3.7                 153. <NA>      NA     NA NA   
# ℹ 644 more rows
# ℹ 1 more variable: income <fct>

Combining `gdp` and `ict`: Second approach

We want to keep all the rows of the ict data frame.

An equivalent way to achieve this is to use the right_join() function.

Combining `gdp` and `ict`: Second approach

We want to keep all the rows of the ict data frame.

gdp |>
  dplyr::right_join(ict)

Joining with `by = join_by(geo, year)`

# A tibble: 654 × 9
   geo          year flag  output growth state income ict_percentage
   <chr>       <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria      2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium      2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria     2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland  2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus       2023 p      29080    1   TRUE  middle            5.4
 6 Czechia      2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany      2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark      2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Estonia      2023 <NA>   15250   -5.4 TRUE  middle            6.7
10 Greece       2023 p      19460    2.6 TRUE  middle            2.4
# ℹ 644 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining `gdp` and `ict`: Second approach

We want to keep all the rows of the ict data frame.

Observe, however, that the order of the rows and columns resulting from a left_join() and a right_join() are different.

Both left_join() and right_join() maintain the order of the rows and columns of the x argument.

Combining `gdp` and `ict`: Third approach

We want to keep only the matching rows of the two data frames.

This approach is the most common type of join operation.
The left and right join operations introduce NA values in the resulting data frame if there are non-matching rows.
Statistical methods often do not work with NA values.
Thus, practically, we often want to keep only the matching rows.

Combining `gdp` and `ict`: Third approach

We want to keep only the matching rows of the two data frames.

We can keep only the matching rows using the inner_join() function.

Combining `gdp` and `ict`: Third approach

We want to keep only the matching rows of the two data frames.

gdp |>
  dplyr::inner_join(ict)

Joining with `by = join_by(geo, year)`

# A tibble: 33 × 9
   geo          year flag  output growth state income ict_percentage
   <chr>       <int> <fct>  <dbl>  <dbl> <lgl> <fct>           <dbl>
 1 Austria      2023 <NA>   37860   -1.8 TRUE  high              5.3
 2 Belgium      2023 p      37310    0.4 TRUE  middle            5.4
 3 Bulgaria     2023 <NA>    7900    2.2 TRUE  low               4.3
 4 Switzerland  2023 p      63870   -0.8 TRUE  high              5.7
 5 Cyprus       2023 p      29080    1   TRUE  middle            5.4
 6 Czechia      2023 <NA>   18480   -1.2 TRUE  middle            4.3
 7 Germany      2023 p      36290   -1.1 TRUE  middle            4.9
 8 Denmark      2023 <NA>   52510    1.8 TRUE  high              5.9
 9 Estonia      2023 <NA>   15250   -5.4 TRUE  middle            6.7
10 Greece       2023 p      19460    2.6 TRUE  middle            2.4
# ℹ 23 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>

Combining `gdp` and `ict`: Third approach

We want to keep only the matching rows of the two data frames.

Similar to the left_join() and right_join() functions, the inner_join() function maintains the order of the rows and columns of the x argument.

Combining `gdp` and `ict`: Third approach

We want to keep only the matching rows of the two data frames.

ict |>
  dplyr::inner_join(gdp)

Joining with `by = join_by(geo, year)`

# A tibble: 33 × 9
   geo        year ict_percentage ict_thousand_persons flag  output growth state
   <chr>     <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria    2023            5.3                237.  <NA>   37860   -1.8 TRUE 
 2 Belgium    2023            5.4                273.  p      37310    0.4 TRUE 
 3 Bulgaria   2023            4.3                126.  <NA>    7900    2.2 TRUE 
 4 Switzerl…  2023            5.7                273   p      63870   -0.8 TRUE 
 5 Cyprus     2023            5.4                 24.7 p      29080    1   TRUE 
 6 Czechia    2023            4.3                218.  <NA>   18480   -1.2 TRUE 
 7 Germany    2023            4.9               2108.  p      36290   -1.1 TRUE 
 8 Denmark    2023            5.9                177.  <NA>   52510    1.8 TRUE 
 9 Estonia    2023            6.7                 46.5 <NA>   15250   -5.4 TRUE 
10 Greece     2023            2.4                100.  p      19460    2.6 TRUE 
# ℹ 23 more rows
# ℹ 1 more variable: income <fct>

Combining `gdp` and `ict`: Fourth approach

We want to keep all the rows of both data frames.

We can keep all the rows of both data frames using the full_join() function.

Combining `gdp` and `ict`: Fourth approach

We want to keep all the rows of both data frames.

ict |>
  dplyr::full_join(gdp)

Joining with `by = join_by(geo, year)`

# A tibble: 657 × 9
   geo      year ict_percentage ict_thousand_persons flag  output growth state
   <chr>   <int>          <dbl>                <dbl> <fct>  <dbl>  <dbl> <lgl>
 1 Austria  2004            2.9                 106. <NA>      NA     NA NA   
 2 Austria  2005            3.1                 116  <NA>      NA     NA NA   
 3 Austria  2006            3.3                 126  <NA>      NA     NA NA   
 4 Austria  2007            3.3                 128. <NA>      NA     NA NA   
 5 Austria  2008            3.3                 133. <NA>      NA     NA NA   
 6 Austria  2009            3.5                 139. <NA>      NA     NA NA   
 7 Austria  2010            3.5                 141. <NA>      NA     NA NA   
 8 Austria  2011            3.6                 145  <NA>      NA     NA NA   
 9 Austria  2012            3.6                 147  <NA>      NA     NA NA   
10 Austria  2013            3.7                 153. <NA>      NA     NA NA   
# ℹ 647 more rows
# ℹ 1 more variable: income <fct>

1 Prerequisites

Prerequisites

1.1 Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

Programming digression: R namespaces

2 Creating boolean variables

2.1 Storing an intermediate result

Storing an intermediate result

2.2 Adding the state boolean variable

Adding the state boolean variable

Adding the state boolean variable

Adding the state boolean variable

Adding the state boolean variable

Adding the state boolean variable

Adding the state boolean variable

2.3 Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

Programming digression: Combining logical conditions

3 Discretizing continuous variables

3.1 Adding the income categorical variable: Attempt 1 (non-reusable)

Adding the income categorical variable: Attempt 1 (non-reusable)

Adding the income categorical variable: Attempt 1 (non-reusable)

Adding the income categorical variable: Attempt 1 (non-reusable)

Adding the income categorical variable: Attempt 1 (non-reusable)

Adding the income categorical variable: Attempt 1 (non-reusable)

3.2 Using quantile()

Using quantile()

Using quantile()

Using quantile()

Using quantile()

Using quantile()

Using quantile()

3.3 Using cut()

Using cut()

Using cut()

Using cut()

Using cut()

Using cut()

Using cut()

Using cut()

3.4 Adding the income categorical variable: Attempt 2 (reusable)

Adding the income categorical variable: Attempt 2 (reusable)

Adding the income categorical variable: Attempt 2 (reusable)

4 A summary of the GDP data frame transformations

4.1 The gdp data frame

The gdp data frame

The gdp data frame

The gdp data frame

The gdp data frame

5 The ict data frame

5.1 A view at the ict data

5.2 The ict times series

5.3 Counting observations

Counting observations

Counting observations

Counting observations

Counting observations

6 Group-Wise operations

6.1 Manually replicating count()

6.2 Using group_by()

Using group_by()

6.3 Transformations per group

Transformations per group

6.4 Using summarize()

Using summarize()

Using summarize()

1.1 Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

Programming digression: `R` namespaces

2.2 Adding the `state` boolean variable

Adding the `state` boolean variable

Adding the `state` boolean variable

Adding the `state` boolean variable

Adding the `state` boolean variable

Adding the `state` boolean variable

Adding the `state` boolean variable

3.1 Adding the `income` categorical variable: Attempt 1 (non-reusable)

Adding the `income` categorical variable: Attempt 1 (non-reusable)

Adding the `income` categorical variable: Attempt 1 (non-reusable)

Adding the `income` categorical variable: Attempt 1 (non-reusable)

Adding the `income` categorical variable: Attempt 1 (non-reusable)

Adding the `income` categorical variable: Attempt 1 (non-reusable)

3.2 Using `quantile()`

Using `quantile()`

Using `quantile()`

Using `quantile()`

Using `quantile()`

Using `quantile()`

Using `quantile()`

3.3 Using `cut()`

Using `cut()`

Using `cut()`

Using `cut()`

Using `cut()`

Using `cut()`

Using `cut()`

Using `cut()`

3.4 Adding the `income` categorical variable: Attempt 2 (reusable)

Adding the `income` categorical variable: Attempt 2 (reusable)

Adding the `income` categorical variable: Attempt 2 (reusable)

4.1 The `gdp` data frame

The `gdp` data frame

The `gdp` data frame

The `gdp` data frame

The `gdp` data frame

5 The `ict` data frame

5.1 A view at the `ict` data

5.2 The `ict` times series

6.1 Manually replicating `count()`

6.2 Using `group_by()`

Using `group_by()`

6.4 Using `summarize()`

Using `summarize()`

Using `summarize()`

Using `summarize()`

Using `summarize()`

Using `summarize()`

Using `summarize()`

6.5 Using `mutate()` with groups

Using `mutate()` with groups

Using `mutate()` with groups

Using `mutate()` with groups

6.6 Grouped `mutate()` vs. `summarize()`

8.2 Programming digression: `all` and `any`

Programming digression: `all` and `any`

Programming digression: `all` and `any`

8.3 Accessing a column via `$`

Accessing a column via `$`

8.4 Accessing a column via `[[`

Accessing a column via `[[`

8.5 Accessing columns via `[`

Accessing columns via `[`

Accessing columns via `[`

Accessing columns via `[`

Accessing columns via `[`

9.2 Combining the `gdp` and `ict` data frames

Combining the `gdp` and `ict` data frames

Combining the `gdp` and `ict` data frames