Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
A follow-up look on data transformation topics. Boolean columns. Discretizing continuous variables. Counting observations. Groupwise operations. Summarizing. Slicing operations. Accessing columns. Combining data with joins.
May 29th, 2025
dplyr in more detail.R namespacessdg data slightly differs from the code we used in the data import overview topic.R namespacesR namespacesreadr package’s functions!
read_csv \(\rightarrow\) readr::read_csvcol_factor \(\rightarrow\) readr::col_factorcol_integer \(\rightarrow\) readr::col_integercol_datetime \(\rightarrow\) readr::col_datetimereadr package!R namespacesR namespaces
R namespacesR has its own namespace.library(), all the functions and objects in the package’s namespace become available globally.library(dplyr), we can use select(), filter(), etc., without specifying the package.R namespacesreadr::read_csv instead of read_csv?
read_csv is shorter and easier to type.R namespacesreadr::read_csv informs directly the reader where the read_csv function comes from.R namespacesdplyr package for the first time in an R session.R namespacesdplyr package for the first time in an R session.R namespacesdplyr, we use its functions with the dplyr:: prefix, we avoid masking functions from other namespaces.eu_ict dataset.mutate() to modify the values of the unit column.mutate() to create the new columns.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE)# A tibble: 36 × 5
geo year flag output growth
<chr> <int> <chr> <dbl> <dbl>
1 Austria 2023 <NA> 37860 -1.8
2 Belgium 2023 p 37310 0.4
3 Bulgaria 2023 <NA> 7900 2.2
4 Switzerland 2023 p 63870 -0.8
5 Cyprus 2023 p 29080 1
6 Czechia 2023 <NA> 18480 -1.2
7 Germany 2023 p 36290 -1.1
8 Denmark 2023 <NA> 52510 1.8
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2
# ℹ 26 more rows
state boolean variablegeo column contains both states and coalitions of states.pull() from dplyr to examine the values of geo.pull() works well with the pipe operator and, hence, stacks well together with other dplyr transformations, such as distinct() and arrange().state boolean variable [1] "Austria"
[2] "Belgium"
[3] "Bulgaria"
[4] "Croatia"
[5] "Cyprus"
[6] "Czechia"
[7] "Denmark"
[8] "Estonia"
[9] "Euro area - 19 countries (2015-2022)"
[10] "Euro area – 20 countries (from 2023)"
[11] "European Union - 27 countries (from 2020)"
[12] "Finland"
[13] "France"
[14] "Germany"
[15] "Greece"
[16] "Hungary"
[17] "Iceland"
[18] "Ireland"
[19] "Italy"
[20] "Latvia"
[21] "Lithuania"
[22] "Luxembourg"
[23] "Malta"
[24] "Montenegro"
[25] "Netherlands"
[26] "Norway"
[27] "Poland"
[28] "Portugal"
[29] "Romania"
[30] "Serbia"
[31] "Slovakia"
[32] "Slovenia"
[33] "Spain"
[34] "Sweden"
[35] "Switzerland"
[36] "Türkiye"
state boolean variablestate binary variable with the value TRUE if the observation corresponds to a state, and the value FALSE otherwise."Euro", we can easily create the new boolean variable using grepl.state boolean variable"Euro", we can easily create the new boolean variable using grepl.# A tibble: 36 × 6
geo year flag output growth state
<chr> <int> <chr> <dbl> <dbl> <lgl>
1 Austria 2023 <NA> 37860 -1.8 TRUE
2 Belgium 2023 p 37310 0.4 TRUE
3 Bulgaria 2023 <NA> 7900 2.2 TRUE
4 Switzerland 2023 p 63870 -0.8 TRUE
5 Cyprus 2023 p 29080 1 TRUE
6 Czechia 2023 <NA> 18480 -1.2 TRUE
7 Germany 2023 p 36290 -1.1 TRUE
8 Denmark 2023 <NA> 52510 1.8 TRUE
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3 FALSE
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2 FALSE
# ℹ 26 more rows
state boolean variable# A tibble: 36 × 6
geo year flag output growth state
<chr> <int> <chr> <dbl> <dbl> <lgl>
1 Austria 2023 <NA> 37860 -1.8 TRUE
2 Belgium 2023 p 37310 0.4 TRUE
3 Bulgaria 2023 <NA> 7900 2.2 TRUE
4 Switzerland 2023 p 63870 -0.8 TRUE
5 Cyprus 2023 p 29080 1 TRUE
6 Czechia 2023 <NA> 18480 -1.2 TRUE
7 Germany 2023 p 36290 -1.1 TRUE
8 Denmark 2023 <NA> 52510 1.8 TRUE
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3 FALSE
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2 FALSE
# ℹ 26 more rows
state boolean variable# A tibble: 36 × 6
geo state year flag output growth
<chr> <lgl> <int> <chr> <dbl> <dbl>
1 Austria TRUE 2023 <NA> 37860 -1.8
2 Belgium TRUE 2023 p 37310 0.4
3 Bulgaria TRUE 2023 <NA> 7900 2.2
4 Switzerland TRUE 2023 p 63870 -0.8
5 Cyprus TRUE 2023 p 29080 1
6 Czechia TRUE 2023 <NA> 18480 -1.2
7 Germany TRUE 2023 p 36290 -1.1
8 Denmark TRUE 2023 <NA> 52510 1.8
9 Euro area - 19 countries (2015-2022) FALSE 2023 <NA> 32340 -0.3
10 Euro area – 20 countries (from 2023) FALSE 2023 <NA> 32150 -0.2
# ℹ 26 more rows
.after to mutate().state boolean variable# A tibble: 36 × 6
geo state year flag output growth
<chr> <lgl> <int> <chr> <dbl> <dbl>
1 Austria TRUE 2023 <NA> 37860 -1.8
2 Belgium TRUE 2023 p 37310 0.4
3 Bulgaria TRUE 2023 <NA> 7900 2.2
4 Switzerland TRUE 2023 p 63870 -0.8
5 Cyprus TRUE 2023 p 29080 1
6 Czechia TRUE 2023 <NA> 18480 -1.2
7 Germany TRUE 2023 p 36290 -1.1
8 Denmark TRUE 2023 <NA> 52510 1.8
9 Euro area - 19 countries (2015-2022) FALSE 2023 <NA> 32340 -0.3
10 Euro area – 20 countries (from 2023) FALSE 2023 <NA> 32150 -0.2
# ℹ 26 more rows
! before grepl().! before grepl().R (as in most, if not all, programming languages) can be combined or negated.! operator negates the logical value of the condition.! before grepl().grepl("Euro", geo) returns TRUE if the string "Euro" is found in the geo column and FALSE otherwise.TRUE when geo is a state, i.e., when "Euro" is not found in the geo column.grepl("Euro", geo).TRUE when geo is a state and output is above 50,000.geo is a state with !grepl("Euro", geo).output is above 50,000 with output > 50000.TRUE when geo is a state and output is above 50,000.AND operator gives us the desired condition.R, the logical AND operator is &.# A tibble: 36 × 6
geo year flag output growth rich_state
<chr> <int> <chr> <dbl> <dbl> <lgl>
1 Austria 2023 <NA> 37860 -1.8 FALSE
2 Belgium 2023 p 37310 0.4 FALSE
3 Bulgaria 2023 <NA> 7900 2.2 FALSE
4 Switzerland 2023 p 63870 -0.8 TRUE
5 Cyprus 2023 p 29080 1 FALSE
6 Czechia 2023 <NA> 18480 -1.2 FALSE
7 Germany 2023 p 36290 -1.1 FALSE
8 Denmark 2023 <NA> 52510 1.8 TRUE
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3 FALSE
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2 FALSE
# ℹ 26 more rows
OR operators.R, logical OR is denoted with |.TRUE when geo is either Germany or Cyprus.geo is Germany with geo == "Germany".geo is Cyprus with geo == "Cyprus".geo == "Germany" | geo == "Cyprus".# A tibble: 36 × 6
geo year flag output growth ger_or_cy
<chr> <int> <chr> <dbl> <dbl> <lgl>
1 Austria 2023 <NA> 37860 -1.8 FALSE
2 Belgium 2023 p 37310 0.4 FALSE
3 Bulgaria 2023 <NA> 7900 2.2 FALSE
4 Switzerland 2023 p 63870 -0.8 FALSE
5 Cyprus 2023 p 29080 1 TRUE
6 Czechia 2023 <NA> 18480 -1.2 FALSE
7 Germany 2023 p 36290 -1.1 TRUE
8 Denmark 2023 <NA> 52510 1.8 FALSE
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3 FALSE
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2 FALSE
# ℹ 26 more rows
income variable in the visualization overview topic to color and shape various geometric objects.income variable was a categorical variable with three levels: low, middle, and high.income categorical variable: Attempt 1 (non-reusable)income with three levels: low, middle, and high.output thresholds.
output as low.output as high.ifelse() statements to create the income variable.income categorical variable: Attempt 1 (non-reusable)income with three levels: low, middle, and high.output thresholds, respectively.income categorical variable: Attempt 1 (non-reusable)income with three levels: low, middle, and high.income categorical variable: Attempt 1 (non-reusable)sdg_temp |>
mutate(
state = !grepl("Euro", geo),
income = ifelse(
output > 37400,
"high",
ifelse(output > 15200, "middle", "low")
)
)# A tibble: 36 × 7
geo year flag output growth state income
<chr> <int> <chr> <dbl> <dbl> <lgl> <chr>
1 Austria 2023 <NA> 37860 -1.8 TRUE high
2 Belgium 2023 p 37310 0.4 TRUE middle
3 Bulgaria 2023 <NA> 7900 2.2 TRUE low
4 Switzerland 2023 p 63870 -0.8 TRUE high
5 Cyprus 2023 p 29080 1 TRUE middle
6 Czechia 2023 <NA> 18480 -1.2 TRUE middle
7 Germany 2023 p 36290 -1.1 TRUE middle
8 Denmark 2023 <NA> 52510 1.8 TRUE high
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3 FALSE middle
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2 FALSE middle
# ℹ 26 more rows
income categorical variable: Attempt 1 (non-reusable)income variable.income categorical variable: Attempt 1 (non-reusable)cut() and quantile() functions.output is below the 25th percentile.output is above the 75th percentile.quantile()quantile() function by accessing its documentation.quantile()Usage:
quantile(x, ...)
## Default S3 method:
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
names = TRUE, type = 7, digits = 7, ...)
Arguments:
x: numeric vector whose sample quantiles are wanted, or an
object of a class for which a method has been defined (see
also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
numeric vectors unless ‘na.rm’ is ‘TRUE’.
probs: numeric vector of probabilities with values in [0,1].
(Values up to ‘2e-14’ outside that range are accepted and
moved to the nearby endpoint.)
quantile()Usage:
quantile(x, ...)
## Default S3 method:
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
names = TRUE, type = 7, digits = 7, ...)
Arguments:
x: numeric vector whose sample quantiles are wanted, or an
object of a class for which a method has been defined (see
also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
numeric vectors unless ‘na.rm’ is ‘TRUE’.
probs: numeric vector of probabilities with values in [0,1].
(Values up to ‘2e-14’ outside that range are accepted and
moved to the nearby endpoint.)
x argument in our case is the output variable.quantile()Usage:
quantile(x, ...)
## Default S3 method:
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
names = TRUE, type = 7, digits = 7, ...)
Arguments:
x: numeric vector whose sample quantiles are wanted, or an
object of a class for which a method has been defined (see
also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
numeric vectors unless ‘na.rm’ is ‘TRUE’.
probs: numeric vector of probabilities with values in [0,1].
(Values up to ‘2e-14’ outside that range are accepted and
moved to the nearby endpoint.)
probs argument to specify the percentiles we want to use.quantile()Usage:
quantile(x, ...)
## Default S3 method:
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
names = TRUE, type = 7, digits = 7, ...)
Arguments:
x: numeric vector whose sample quantiles are wanted, or an
object of a class for which a method has been defined (see
also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
numeric vectors unless ‘na.rm’ is ‘TRUE’.
probs: numeric vector of probabilities with values in [0,1].
(Values up to ‘2e-14’ outside that range are accepted and
moved to the nearby endpoint.)
probs argument is to set the quartile probabilities \([0, 0.25, 0.5, 0.75, 1]\) (why?).quantile()Usage:
quantile(x, ...)
## Default S3 method:
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
names = TRUE, type = 7, digits = 7, ...)
Arguments:
x: numeric vector whose sample quantiles are wanted, or an
object of a class for which a method has been defined (see
also ‘details’). ‘NA’ and ‘NaN’ values are not allowed in
numeric vectors unless ‘na.rm’ is ‘TRUE’.
probs: numeric vector of probabilities with values in [0,1].
(Values up to ‘2e-14’ outside that range are accepted and
moved to the nearby endpoint.)
probs = c(0, 0.25, 0.75, 1).quantile()output thresholds we are after.output belongs to the three induced bins:cut()cut() function.cut() function expects a numeric vector and a set of breaks.cut() [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
[4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] (6.5e+03,1.52e+04]
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]
[25] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04] <NA> (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
cut() [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
[4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] (6.5e+03,1.52e+04]
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]
[25] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04] <NA> (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
cut() [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
[4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] (6.5e+03,1.52e+04]
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]
[25] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04] <NA> (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
NA value.cut() [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
[4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04] (6.5e+03,1.52e+04]
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04]
[25] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] (6.5e+03,1.52e+04] (1.52e+04,3.74e+04]
[31] (6.5e+03,1.52e+04] <NA> (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (6.5e+03,1.52e+04]
Levels: (6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
cut()cut(
x = sdg_temp$output,
breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE
) [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]
[4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] [6.5e+03,1.52e+04]
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]
[25] [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04] [6.5e+03,1.52e+04] (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
cut() to include the lower bound in the first bin by setting include.lowest = TRUE.cut()cut(
x = sdg_temp$output,
breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE
) [1] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]
[4] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[7] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[10] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[13] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04]
[16] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04] [6.5e+03,1.52e+04]
[19] (3.74e+04,8.33e+04] (3.74e+04,8.33e+04] (1.52e+04,3.74e+04]
[22] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04]
[25] [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
[28] (3.74e+04,8.33e+04] [6.5e+03,1.52e+04] (1.52e+04,3.74e+04]
[31] [6.5e+03,1.52e+04] [6.5e+03,1.52e+04] (3.74e+04,8.33e+04]
[34] (1.52e+04,3.74e+04] (1.52e+04,3.74e+04] [6.5e+03,1.52e+04]
Levels: [6.5e+03,1.52e+04] (1.52e+04,3.74e+04] (3.74e+04,8.33e+04]
NA issue (why?).cut()cut(
x = sdg_temp$output,
breaks = quantile(sdg_temp$output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
) [1] high middle low high middle middle middle high middle middle
[11] middle middle middle middle middle middle low low high high
[21] middle middle high low low middle high high low middle
[31] low low high middle middle low
Levels: low middle high
cut() to label the bins as low, middle, and high by setting the labels argument.income categorical variable: Attempt 2 (reusable)cut() function with the quantile() function, we can create the income variable.output variable changes.income categorical variable: Attempt 2 (reusable)cut() function with the quantile() function, we can create the income variable.cut() and quantile() can be used to discretize continuous variables into categorical ones based on empirical percentiles.income categorical variable: Attempt 2 (reusable)sdg_temp |>
mutate(
state = !grepl("Euro", geo),
income = cut(
x = output,
breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
)
)# A tibble: 36 × 7
geo year flag output growth state income
<chr> <int> <chr> <dbl> <dbl> <lgl> <fct>
1 Austria 2023 <NA> 37860 -1.8 TRUE high
2 Belgium 2023 p 37310 0.4 TRUE middle
3 Bulgaria 2023 <NA> 7900 2.2 TRUE low
4 Switzerland 2023 p 63870 -0.8 TRUE high
5 Cyprus 2023 p 29080 1 TRUE middle
6 Czechia 2023 <NA> 18480 -1.2 TRUE middle
7 Germany 2023 p 36290 -1.1 TRUE middle
8 Denmark 2023 <NA> 52510 1.8 TRUE high
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3 FALSE middle
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2 FALSE middle
# ℹ 26 more rows
estat_sdg_08_10_en data to create the gdp data frame.gdp data frame, from import to final structure, in a single place.gdp data framegdp <- readr::read_csv(
file = "data/estat_sdg_08_10_en.csv",
col_types = list(
unit = readr::col_factor(),
TIME_PERIOD = readr::col_integer(),
`LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
)
) |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
mutate(
state = !grepl("Euro", geo),
income = cut(
x = output,
breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
)
)gdp data framegdp <- readr::read_csv(
file = "data/estat_sdg_08_10_en.csv",
col_types = list(
unit = readr::col_factor(),
TIME_PERIOD = readr::col_integer(),
`LAST UPDATE` = readr::col_datetime(format = "%d/%m/%y %H:%M:%S")
)
) |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
mutate(
state = !grepl("Euro", geo),
income = cut(
x = output,
breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
)
)LAST UPDATE is not used in the final or any intermediate structure.read_csv adds only noise when reading the code.gdp data framegdp <- readr::read_csv(
file = "data/estat_sdg_08_10_en.csv",
col_types = list(
unit = readr::col_factor(),
TIME_PERIOD = readr::col_integer()
)
) |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
mutate(
state = !grepl("Euro", geo),
income = cut(
x = output,
breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
)
)unit column type is defined in read_csv, but, because it is used as the pivoting variable in pivot_wider, it does not appear in the final structure.flag (originally OBS_FLAG), which is used in the final structure, is never assigned a type.gdp data framegdp <- readr::read_csv(
file = "data/estat_sdg_08_10_en.csv",
col_types = list(
OBS_FLAG = readr::col_factor(),
TIME_PERIOD = readr::col_integer()
)
) |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
mutate(
state = !grepl("Euro", geo),
income = cut(
x = output,
breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
)
)dplyr:: to call the dplyr functions directly, informing the reader where the functions come from and avoiding loading the entire dplyr namespace and masking other functions.gdp data framegdp <- readr::read_csv(
file = "data/estat_sdg_08_10_en.csv",
col_types = list(
OBS_FLAG = readr::col_factor(),
TIME_PERIOD = readr::col_integer()
)
) |>
dplyr::select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
dplyr::rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
dplyr::mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
dplyr::filter(year == 2023) |>
tidyr::pivot_wider(names_from = unit, values_from = OBS_VALUE) |>
dplyr::mutate(
state = !grepl("Euro", geo),
income = cut(
x = output,
breaks = quantile(output, probs = c(0, 0.25, 0.75, 1)),
include.lowest = TRUE,
labels = c("low", "middle", "high")
)
)ict data frameestat_isoc_sks_itspt_en.csv, we can apply similar transformations to create the ict data frame (How?).ict dataict times seriesgdp data frame, we have filtered the data to keep only the year 2023.ict data frame, we have kept the time dimension of the data.count() function to answer the question.dplyr functions, count() takes the data frame as its first argument.count() gives the number of observations for each unique combination of values in the specified variables.geo variable.n of integer type.geo.count() without specifying any variables, we get the total number of observations in the data frame.count() with more than one variable.# A tibble: 70 × 3
geo after_2020 n
<fct> <lgl> <int>
1 Austria FALSE 17
2 Austria TRUE 3
3 Bosnia and Herzegovina TRUE 3
4 Belgium FALSE 17
5 Belgium TRUE 3
6 Bulgaria FALSE 17
7 Bulgaria TRUE 3
8 Switzerland FALSE 10
9 Switzerland TRUE 3
10 Cyprus FALSE 17
# ℹ 60 more rows
count() is a group-wise operation.count() groups the data frame by the specified variables.count() to get a first idea of how grouping works.count()dplyr that we want to group the data by country.group_by() function.group_by()group_by() function.dplyr functions, group_by() takes the data frame as its first argument.group_by()# A tibble: 654 × 4
# Groups: geo [37]
geo year ict_percentage ict_thousand_persons
<fct> <int> <dbl> <dbl>
1 Austria 2004 2.9 106.
2 Austria 2005 3.1 116
3 Austria 2006 3.3 126
4 Austria 2007 3.3 128.
5 Austria 2008 3.3 133.
6 Austria 2009 3.5 139.
7 Austria 2010 3.5 141.
8 Austria 2011 3.6 145
9 Austria 2012 3.6 147
10 Austria 2013 3.7 153.
# ℹ 644 more rows
geo.dplyr to count observations on a grouped data frame, it counts the number of observations in each group.summarize()dplyr, summarizing operations are performed via summarize().summarize() is similar to that of mutate().summarize()dplyr, summarizing operations are performed via summarize().mutate(), summarize() returns a new data frame containing
summarize()dplyr’s function n() as the summarizing statistic in this example.n() counts the number of observations in each group.n() to a new variable nobs.summarize(), the resulting data frame is not grouped.summarize()summarize() is to ungroup the most nested grouping variable.geo, removing grouping based on geo results in an ungrouped data frame..groups = "keep" to summarize().summarize()# A tibble: 37 × 2
# Groups: geo [37]
geo nobs
<fct> <int>
1 Austria 20
2 Bosnia and Herzegovina 3
3 Belgium 20
4 Bulgaria 20
5 Switzerland 13
6 Cyprus 20
7 Czechia 20
8 Germany 20
9 Denmark 20
10 Estonia 20
# ℹ 27 more rows
summarize()# A tibble: 37 × 2
# Groups: geo [37]
geo nobs
<fct> <int>
1 Austria 20
2 Bosnia and Herzegovina 3
3 Belgium 20
4 Bulgaria 20
5 Switzerland 13
6 Cyprus 20
7 Czechia 20
8 Germany 20
9 Denmark 20
10 Estonia 20
# ℹ 27 more rows
summarize() returns only the grouping and newly created variables.summarize()summarize() returns only the grouping and newly created variables.mutate() with groupsgroup_by(), mutate(), and n() to achieve the desired result.mutate() with groupsmutate() applies its transformations within each group.mutate() with groups# A tibble: 654 × 5
# Groups: geo [37]
geo year ict_percentage ict_thousand_persons nobs
<fct> <int> <dbl> <dbl> <int>
1 Bosnia and Herzegovina 2021 1.5 17.6 3
2 Bosnia and Herzegovina 2022 1.7 19.3 3
3 Bosnia and Herzegovina 2023 2 24.3 3
4 United Kingdom 2011 4.8 1392. 9
5 United Kingdom 2012 5 1461. 9
6 United Kingdom 2013 4.9 1477. 9
7 United Kingdom 2014 5 1517. 9
8 United Kingdom 2015 5.2 1624. 9
9 United Kingdom 2016 5.3 1674 9
10 United Kingdom 2017 5.2 1657. 9
# ℹ 644 more rows
mutate() keeps all the rows of the original data frame.summarize(), maintains the grouping.mutate() with groups# A tibble: 654 × 5
geo year ict_percentage ict_thousand_persons nobs
<fct> <int> <dbl> <dbl> <int>
1 Austria 2004 2.9 106. 20
2 Austria 2005 3.1 116 20
3 Austria 2006 3.3 126 20
4 Austria 2007 3.3 128. 20
5 Austria 2008 3.3 133. 20
6 Austria 2009 3.5 139. 20
7 Austria 2010 3.5 141. 20
8 Austria 2011 3.6 145 20
9 Austria 2012 3.6 147 20
10 Austria 2013 3.7 153. 20
# ℹ 644 more rows
ungroup().mutate() vs. summarize()| Action | mutate() with grouping |
summarize() |
|---|---|---|
| Transformation | Per group | Per group |
| Assignment | One value per group | One value per group |
| Result rows | All rows | One row per group |
| Result columns | All columns | Grouped columns |
| Automated ungrouping | No | Yes |
summarize() and collapsing each group into a single row.mutate() and keeping all rows.mean() function.# A tibble: 37 × 2
geo avg
<fct> <dbl>
1 Austria 3.90
2 Bosnia and Herzegovina 1.73
3 Belgium 4.44
4 Bulgaria 2.82
5 Switzerland 5.08
6 Cyprus 2.94
7 Czechia 3.99
8 Germany 3.88
9 Denmark 4.84
10 Estonia 4.51
# ℹ 27 more rows
base R functions like mean().base R functions like median(), sd(), min(), and max().ict |>
dplyr::group_by(geo) |>
dplyr::summarize(
nobs = dplyr::n(),
min = min(ict_percentage),
mean = mean(ict_percentage),
median = median(ict_percentage),
sd = sd(ict_percentage),
max = max(ict_percentage)
)# A tibble: 37 × 7
geo nobs min mean median sd max
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austria 20 2.9 3.90 3.65 0.653 5.3
2 Bosnia and Herzegovina 3 1.5 1.73 1.7 0.252 2
3 Belgium 20 3.4 4.44 4.2 0.669 5.6
4 Bulgaria 20 2.2 2.82 2.5 0.587 4.3
5 Switzerland 13 4.3 5.08 5 0.438 5.7
6 Cyprus 20 2.2 2.94 2.7 0.810 5.4
7 Czechia 20 3.4 3.99 3.95 0.439 4.6
8 Germany 20 3.1 3.88 3.7 0.550 5
9 Denmark 20 4.1 4.84 4.8 0.481 5.9
10 Estonia 20 2.5 4.51 4.1 1.29 6.7
# ℹ 27 more rows
ict |>
dplyr::group_by(geo) |>
dplyr::summarize(
custom1 = sd(ict_percentage) / median(ict_percentage),
custom2 = 2 * mean(ict_percentage) - 1
)# A tibble: 37 × 3
geo custom1 custom2
<fct> <dbl> <dbl>
1 Austria 0.179 6.81
2 Bosnia and Herzegovina 0.148 2.47
3 Belgium 0.159 7.89
4 Bulgaria 0.235 4.65
5 Switzerland 0.0876 9.15
6 Cyprus 0.300 4.87
7 Czechia 0.111 6.98
8 Germany 0.149 6.76
9 Denmark 0.100 8.68
10 Estonia 0.314 8.03
# ℹ 27 more rows
# A tibble: 111 × 4
# Groups: geo [37]
geo year ict_percentage ict_thousand_persons
<fct> <int> <dbl> <dbl>
1 Austria 2004 2.9 106.
2 Austria 2005 3.1 116
3 Austria 2006 3.3 126
4 Bosnia and Herzegovina 2021 1.5 17.6
5 Bosnia and Herzegovina 2022 1.7 19.3
6 Bosnia and Herzegovina 2023 2 24.3
7 Belgium 2004 3.4 143.
8 Belgium 2005 3.5 147.
9 Belgium 2006 3.7 156.
10 Bulgaria 2004 2.2 63.8
# ℹ 101 more rows
# A tibble: 111 × 4
# Groups: geo [37]
geo year ict_percentage ict_thousand_persons
<fct> <int> <dbl> <dbl>
1 Austria 2021 4.5 192.
2 Austria 2022 5 221.
3 Austria 2023 5.3 237.
4 Bosnia and Herzegovina 2021 1.5 17.6
5 Bosnia and Herzegovina 2022 1.7 19.3
6 Bosnia and Herzegovina 2023 2 24.3
7 Belgium 2021 5.6 272.
8 Belgium 2022 5.6 278.
9 Belgium 2023 5.4 273.
10 Bulgaria 2021 3.5 108
# ℹ 101 more rows
slice_head() and slice_tail(), there are other slicing operations available in dplyr.slice_sample(n = 1): Randomly selects \(n\) rows from each group.slice_min(n = 1): Returns the rows with the \(n\) smallest values of a variable in a group.slice_max(n = 1): Returns the rows with the \(n\) largest values of a variable in a group.dplyr pipeline.pull() function from the dplyr package.$ operator.[[ operator.[ operator (for multiple columns).pull() function from the dplyr package.pulled_col is a vector containing the values of the geo column of the gdp data frame.gdp data frame.gdp data frame.length() function from base R.pulled_col is identical to the geo column requires a bit more work.all and anyR, using one of the logical comparison operators, such as ==, to compare two vectors returns a logical vector.all and anyall() function from base R accepts a logical vector and returns TRUE if all elements are TRUE.all and anyany() function from base R accepts a logical vector and returns TRUE if at least one element is TRUE.$$ is a special R operator acting on vectors, lists, and data frames to extract or replace parts.$ to directly access the values of a column as a vector.$pulled_col extracted with pull() is identical to the geo column of gdp.[[[[ indexing operator.[[[dplyr’s select() to select multiple columns.[ operator to access multiple columns of a data frame.[[ operator to access multiple columns of a data frame.[[ operator to access multiple columns of a data frame.select().[[ operator to access multiple columns of a data frame.[ operator can also be used to access a single column, but it returns a data frame instead of a vector.[[ operator to access multiple columns of a data frame.[ operator can also be used to access a single column, but it returns a data frame instead of a vector.dplyr’s distinct().unique() from base R.%in% operation.%in% with two vectors returns a new boolean vector.%in%.%in% with a vector on the left-hand side checks elementwise if the values of the left vector can be found in the values of the right vector.%in% with all to examine if a set is a subset of another.R has a function setdiff() that does exactly this.setdiff() returns the values of the first argument that do not appear in the second argument.intersect() function to find the common elements of two vectors.setdiff(), the order of the arguments does not matter because the intersection operation is symmetric.union().gdp: Contains growth rates and output data for EU and (some) non-EU countries.ict: Contains ICT employment data for EU and (some) non-EU countries.gdp and ict data framesgdp and ict data frames.names() function from base R.geo and year are common in both data frames.gdp and ict data framesgeo and year are common in both data frames.geo and year are common, but their underlying values are not identical.gdp and ict data framesgeo and year are common in both data frames.geo and year are common, but their underlying values are not identical.geo columns differences are:gdp and ict data framesgdp data frame?ict data frame?gdp and ict: First approachgdp data frame.ict_percentage and ict_thousand_persons, with the values from the ict data frame.gdp that do not have a corresponding row in ict?gdp and ict: First approachgdp data frame.ict_percentage and ict_thousand_persons, with the values from the ict data frame.geo and year values of a row in gdp are equal to the geo and year values of a row in ict, then copy the values.geo and year values of a row in gdp cannot be found in ict, then assign NA.gdp and ict: First approachleft_join() function from the dplyr package.left_join() takes two data frames, x and y.x and adds the columns of y that do not exist in x.gdp and ict: First approachleft_join() function from the dplyr package.Joining with `by = join_by(geo, year)`
# A tibble: 36 × 9
geo year flag output growth state income ict_percentage
<chr> <int> <fct> <dbl> <dbl> <lgl> <fct> <dbl>
1 Austria 2023 <NA> 37860 -1.8 TRUE high 5.3
2 Belgium 2023 p 37310 0.4 TRUE middle 5.4
3 Bulgaria 2023 <NA> 7900 2.2 TRUE low 4.3
4 Switzerland 2023 p 63870 -0.8 TRUE high 5.7
5 Cyprus 2023 p 29080 1 TRUE middle 5.4
6 Czechia 2023 <NA> 18480 -1.2 TRUE middle 4.3
7 Germany 2023 p 36290 -1.1 TRUE middle 4.9
8 Denmark 2023 <NA> 52510 1.8 TRUE high 5.9
9 Euro area - 19 countri… 2023 <NA> 32340 -0.3 FALSE middle NA
10 Euro area – 20 countri… 2023 <NA> 32150 -0.2 FALSE middle NA
# ℹ 26 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>
gdp and ict: First approach verificationleft_join() function works as expected.gdp and ict: First approach verificationleft_join() function works as expected.geo and year columns contain exactly the values of the gdp data frame.gdp and ict: First approach verificationleft_join() function works as expected.ict_percentage and ict_thousand_persons columns are NA for the rows that do not have a corresponding row in the ict data frame.geo and year in the gdp and ict data frames.setdiff() works with vectors, not data frames.paste() function.paste() accepts two or more vectors, converts them to strings, and concatenates them element-wise.gdp and ict: First approach verificationleft_join() function works as expected.# A tibble: 3 × 9
geo year flag output growth state income ict_percentage
<chr> <int> <fct> <dbl> <dbl> <lgl> <fct> <dbl>
1 Euro area - 19 countrie… 2023 <NA> 32340 -0.3 FALSE middle NA
2 Euro area – 20 countrie… 2023 <NA> 32150 -0.2 FALSE middle NA
3 Montenegro 2023 p 6900 3.7 TRUE low NA
# ℹ 1 more variable: ict_thousand_persons <dbl>
gdp and ict: Second approachict data frame.flag, output, growth, state, and income with the values from the gdp data frame.gdp and ict: Second approachict data frame.gdp and ict in our previous approach.gdp and ict: Second approachict data frame.Joining with `by = join_by(geo, year)`
# A tibble: 654 × 9
geo year ict_percentage ict_thousand_persons flag output growth state
<chr> <int> <dbl> <dbl> <fct> <dbl> <dbl> <lgl>
1 Austria 2004 2.9 106. <NA> NA NA NA
2 Austria 2005 3.1 116 <NA> NA NA NA
3 Austria 2006 3.3 126 <NA> NA NA NA
4 Austria 2007 3.3 128. <NA> NA NA NA
5 Austria 2008 3.3 133. <NA> NA NA NA
6 Austria 2009 3.5 139. <NA> NA NA NA
7 Austria 2010 3.5 141. <NA> NA NA NA
8 Austria 2011 3.6 145 <NA> NA NA NA
9 Austria 2012 3.6 147 <NA> NA NA NA
10 Austria 2013 3.7 153. <NA> NA NA NA
# ℹ 644 more rows
# ℹ 1 more variable: income <fct>
gdp and ict: Second approachict data frame.right_join() function.gdp and ict: Second approachict data frame.Joining with `by = join_by(geo, year)`
# A tibble: 654 × 9
geo year flag output growth state income ict_percentage
<chr> <int> <fct> <dbl> <dbl> <lgl> <fct> <dbl>
1 Austria 2023 <NA> 37860 -1.8 TRUE high 5.3
2 Belgium 2023 p 37310 0.4 TRUE middle 5.4
3 Bulgaria 2023 <NA> 7900 2.2 TRUE low 4.3
4 Switzerland 2023 p 63870 -0.8 TRUE high 5.7
5 Cyprus 2023 p 29080 1 TRUE middle 5.4
6 Czechia 2023 <NA> 18480 -1.2 TRUE middle 4.3
7 Germany 2023 p 36290 -1.1 TRUE middle 4.9
8 Denmark 2023 <NA> 52510 1.8 TRUE high 5.9
9 Estonia 2023 <NA> 15250 -5.4 TRUE middle 6.7
10 Greece 2023 p 19460 2.6 TRUE middle 2.4
# ℹ 644 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>
gdp and ict: Second approachict data frame.left_join() and a right_join() are different.left_join() and right_join() maintain the order of the rows and columns of the x argument.gdp and ict: Third approachNA values in the resulting data frame if there are non-matching rows.NA values.gdp and ict: Third approachinner_join() function.gdp and ict: Third approachJoining with `by = join_by(geo, year)`
# A tibble: 33 × 9
geo year flag output growth state income ict_percentage
<chr> <int> <fct> <dbl> <dbl> <lgl> <fct> <dbl>
1 Austria 2023 <NA> 37860 -1.8 TRUE high 5.3
2 Belgium 2023 p 37310 0.4 TRUE middle 5.4
3 Bulgaria 2023 <NA> 7900 2.2 TRUE low 4.3
4 Switzerland 2023 p 63870 -0.8 TRUE high 5.7
5 Cyprus 2023 p 29080 1 TRUE middle 5.4
6 Czechia 2023 <NA> 18480 -1.2 TRUE middle 4.3
7 Germany 2023 p 36290 -1.1 TRUE middle 4.9
8 Denmark 2023 <NA> 52510 1.8 TRUE high 5.9
9 Estonia 2023 <NA> 15250 -5.4 TRUE middle 6.7
10 Greece 2023 p 19460 2.6 TRUE middle 2.4
# ℹ 23 more rows
# ℹ 1 more variable: ict_thousand_persons <dbl>
gdp and ict: Third approachleft_join() and right_join() functions, the inner_join() function maintains the order of the rows and columns of the x argument.gdp and ict: Third approachJoining with `by = join_by(geo, year)`
# A tibble: 33 × 9
geo year ict_percentage ict_thousand_persons flag output growth state
<chr> <int> <dbl> <dbl> <fct> <dbl> <dbl> <lgl>
1 Austria 2023 5.3 237. <NA> 37860 -1.8 TRUE
2 Belgium 2023 5.4 273. p 37310 0.4 TRUE
3 Bulgaria 2023 4.3 126. <NA> 7900 2.2 TRUE
4 Switzerl… 2023 5.7 273 p 63870 -0.8 TRUE
5 Cyprus 2023 5.4 24.7 p 29080 1 TRUE
6 Czechia 2023 4.3 218. <NA> 18480 -1.2 TRUE
7 Germany 2023 4.9 2108. p 36290 -1.1 TRUE
8 Denmark 2023 5.9 177. <NA> 52510 1.8 TRUE
9 Estonia 2023 6.7 46.5 <NA> 15250 -5.4 TRUE
10 Greece 2023 2.4 100. p 19460 2.6 TRUE
# ℹ 23 more rows
# ℹ 1 more variable: income <fct>
gdp and ict: Fourth approachfull_join() function.gdp and ict: Fourth approachJoining with `by = join_by(geo, year)`
# A tibble: 657 × 9
geo year ict_percentage ict_thousand_persons flag output growth state
<chr> <int> <dbl> <dbl> <fct> <dbl> <dbl> <lgl>
1 Austria 2004 2.9 106. <NA> NA NA NA
2 Austria 2005 3.1 116 <NA> NA NA NA
3 Austria 2006 3.3 126 <NA> NA NA NA
4 Austria 2007 3.3 128. <NA> NA NA NA
5 Austria 2008 3.3 133. <NA> NA NA NA
6 Austria 2009 3.5 139. <NA> NA NA NA
7 Austria 2010 3.5 141. <NA> NA NA NA
8 Austria 2011 3.6 145 <NA> NA NA NA
9 Austria 2012 3.6 147 <NA> NA NA NA
10 Austria 2013 3.7 153. <NA> NA NA NA
# ℹ 647 more rows
# ℹ 1 more variable: income <fct>