Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Overview of data transformation activities. Data frames. Distinct values. Selecting, renaming, and mutating variables. Filtering and arranging data. Data pivoting.
February 13th, 2025
eu_ict dataset to create various visualizations.eu_ict dataset from scratch.R.dplyr and tidyr packages to perform data transformations.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
dplyr package provides three main types of data transformations:
tidyr package provides two main types of data transformations:
eu_ict dataset combines data from three Eurostat datasets:
readr package to read the data from the original sources.Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1308 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): DATAFLOW, LAST UPDATE, freq, unit, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 496671 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): DATAFLOW, LAST UPDATE, freq, unit, sex, age, citizen, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sdg data frame.# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
sdg is stored as a tibble.tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
tibble’s from standard output# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
unit, geo, TIME_PERIOD, OBS_VALUE, and OBS_FLAG contain information of interest.DATAFLOW, LAST UPDATE, freq, and na_item contain metadata and have no variation within the dataset.DATAFLOW, LAST UPDATE, freq, and na_item contain metadata and have no variation within the dataset.dplyr function distinct(), and, at the same time, perform our first data transformation operation.distinct() returns the rows of the data frame with unique combinations of the specified columns.distinct() is the data frame.DATAFLOW and LAST UPDATE values in the sdg dataset.distinct()’s resultdistinct() contains only the specified columns.distinct() to keep all columns by using the .keep_all argument.distinct() returns the first row of each unique combination of the specified columns.distinct()’s result# A tibble: 1 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_10… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
# ℹ 1 more variable: OBS_FLAG <chr>
distinct() to keep all columns by using the .keep_all argument.distinct() returns the first row of each unique combination of the specified columns.distinct()’s argumentsdistinct() call, the LAST UPDATE column is enclosed in backticks.LAST UPDATE column name contains a space.R cannot contain spaces.R.distinct()’s argumentsdistinct() call, the LAST UPDATE column is enclosed in backticks.dplyr calls or other R operations.distinct() using the usual parenthesized notation.dplyr functions is by using the pipe operator |>.DATAFLOW, LAST UPDATE, freq, and na_item are not very relevant for our analysis.select() function to keep only the variables we are interested in.select() function to keep only the variables we are interested in.# A tibble: 1,845 × 5
unit geo TIME_PERIOD OBS_VALUE OBS_FLAG
<chr> <chr> <dbl> <dbl> <chr>
1 Chain linked volumes (2010), euro per c… Alba… 2000 1700 <NA>
2 Chain linked volumes (2010), euro per c… Alba… 2001 1850 <NA>
3 Chain linked volumes (2010), euro per c… Alba… 2002 1940 <NA>
4 Chain linked volumes (2010), euro per c… Alba… 2003 2060 <NA>
5 Chain linked volumes (2010), euro per c… Alba… 2004 2180 <NA>
6 Chain linked volumes (2010), euro per c… Alba… 2005 2310 <NA>
7 Chain linked volumes (2010), euro per c… Alba… 2006 2460 <NA>
8 Chain linked volumes (2010), euro per c… Alba… 2007 2630 <NA>
9 Chain linked volumes (2010), euro per c… Alba… 2008 2850 d
10 Chain linked volumes (2010), euro per c… Alba… 2009 2960 <NA>
# ℹ 1,835 more rows
select() function.R, we concatenate objects using the c() function.select() function.# A tibble: 1,845 × 5
unit geo TIME_PERIOD OBS_VALUE OBS_FLAG
<chr> <chr> <dbl> <dbl> <chr>
1 Chain linked volumes (2010), euro per c… Alba… 2000 1700 <NA>
2 Chain linked volumes (2010), euro per c… Alba… 2001 1850 <NA>
3 Chain linked volumes (2010), euro per c… Alba… 2002 1940 <NA>
4 Chain linked volumes (2010), euro per c… Alba… 2003 2060 <NA>
5 Chain linked volumes (2010), euro per c… Alba… 2004 2180 <NA>
6 Chain linked volumes (2010), euro per c… Alba… 2005 2310 <NA>
7 Chain linked volumes (2010), euro per c… Alba… 2006 2460 <NA>
8 Chain linked volumes (2010), euro per c… Alba… 2007 2630 <NA>
9 Chain linked volumes (2010), euro per c… Alba… 2008 2850 d
10 Chain linked volumes (2010), euro per c… Alba… 2009 2960 <NA>
# ℹ 1,835 more rows
select() function.c() is superfluous.# A tibble: 1,845 × 5
unit geo TIME_PERIOD OBS_VALUE OBS_FLAG
<chr> <chr> <dbl> <dbl> <chr>
1 Chain linked volumes (2010), euro per c… Alba… 2000 1700 <NA>
2 Chain linked volumes (2010), euro per c… Alba… 2001 1850 <NA>
3 Chain linked volumes (2010), euro per c… Alba… 2002 1940 <NA>
4 Chain linked volumes (2010), euro per c… Alba… 2003 2060 <NA>
5 Chain linked volumes (2010), euro per c… Alba… 2004 2180 <NA>
6 Chain linked volumes (2010), euro per c… Alba… 2005 2310 <NA>
7 Chain linked volumes (2010), euro per c… Alba… 2006 2460 <NA>
8 Chain linked volumes (2010), euro per c… Alba… 2007 2630 <NA>
9 Chain linked volumes (2010), euro per c… Alba… 2008 2850 d
10 Chain linked volumes (2010), euro per c… Alba… 2009 2960 <NA>
# ℹ 1,835 more rows
select()’s resultsselect() function operates on the columns of a data frame.nrow() function, which returns the number of rows in a data frame.TIME_PERIOD is among the variables we selected.year?rename()rename() function of package dplyr can be used to rename variables.=.=.rename()rename() function of package dplyr can be used to rename variables.# A tibble: 1,845 × 5
unit geo year OBS_VALUE OBS_FLAG
<chr> <chr> <dbl> <dbl> <chr>
1 Chain linked volumes (2010), euro per capita Albania 2000 1700 <NA>
2 Chain linked volumes (2010), euro per capita Albania 2001 1850 <NA>
3 Chain linked volumes (2010), euro per capita Albania 2002 1940 <NA>
4 Chain linked volumes (2010), euro per capita Albania 2003 2060 <NA>
5 Chain linked volumes (2010), euro per capita Albania 2004 2180 <NA>
6 Chain linked volumes (2010), euro per capita Albania 2005 2310 <NA>
7 Chain linked volumes (2010), euro per capita Albania 2006 2460 <NA>
8 Chain linked volumes (2010), euro per capita Albania 2007 2630 <NA>
9 Chain linked volumes (2010), euro per capita Albania 2008 2850 d
10 Chain linked volumes (2010), euro per capita Albania 2009 2960 <NA>
# ℹ 1,835 more rows
rename()rename() to rename multiple variables at once.TIME_PERIOD and OBS_FLAG to year and flag, respectively.rename()rename() to rename multiple variables at once.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG)# A tibble: 1,845 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 Chain linked volumes (2010), euro per capita Albania 2000 1700 <NA>
2 Chain linked volumes (2010), euro per capita Albania 2001 1850 <NA>
3 Chain linked volumes (2010), euro per capita Albania 2002 1940 <NA>
4 Chain linked volumes (2010), euro per capita Albania 2003 2060 <NA>
5 Chain linked volumes (2010), euro per capita Albania 2004 2180 <NA>
6 Chain linked volumes (2010), euro per capita Albania 2005 2310 <NA>
7 Chain linked volumes (2010), euro per capita Albania 2006 2460 <NA>
8 Chain linked volumes (2010), euro per capita Albania 2007 2630 <NA>
9 Chain linked volumes (2010), euro per capita Albania 2008 2850 d
10 Chain linked volumes (2010), euro per capita Albania 2009 2960 <NA>
# ℹ 1,835 more rows
unit variable in the resulting data frame, we observe it takes two distinct values.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
distinct(unit)# A tibble: 2 × 1
unit
<chr>
1 Chain linked volumes (2010), euro per capita
2 Chain linked volumes, percentage change on previous period, per capita
mutate()The dplyr’s mutate() function allows us to modify existing variables or create new variables based on the existing ones.
We can use mutate() by passing one or more keyword arguments that specify the desired transformations.
mutate() transformationunit such that:
mutate() transformationunit such that for each value \(x\)
mutate() transformationunit such that for each value \(x\)
mutate() transformationR!grepl()R function grepl() to check if a value contains the word “percentage”.pattern that specifies the word we are looking for, andx that specifies the value we are examining.TRUE if the pattern is found, and FALSE otherwise.grepl()[1] FALSE
[1] TRUE
grepl()grepl(), we can rewrite our pseudocode in syntactically valid R code:if-else control structure, which cannot be directly used in mutate().ifelse()R function ifelse(), which provides a functional form of the if-else control structure.TRUE.FALSE.ifelse()ifelse()grepl() and ifelse combined, we can rewrite our pseudocode in a single-line syntactically valid R code:mutate() transformationmutate() to simplify the values of the unit variable.mutate() function receives a keyword argument that specifies the transformation we want to perform.rename(), the new name or the modified variable name is at the left-hand side of =.= and can use one or more existing variables.mutate() transformationsdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output"))# A tibble: 1,845 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 output Albania 2000 1700 <NA>
2 output Albania 2001 1850 <NA>
3 output Albania 2002 1940 <NA>
4 output Albania 2003 2060 <NA>
5 output Albania 2004 2180 <NA>
6 output Albania 2005 2310 <NA>
7 output Albania 2006 2460 <NA>
8 output Albania 2007 2630 <NA>
9 output Albania 2008 2850 d
10 output Albania 2009 2960 <NA>
# ℹ 1,835 more rows
range() to find the range of years in the dataset.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
distinct(year) |>
range()[1] 2000 2023
filter()filter() function of the dplyr package can be used to filter rows.TRUE, the row is kept.RR can be specified using the following operators:
== (equal to)!= (not equal to)> (greater than)>= (greater than or equal to)< (less than)<= (less than or equal to)year variable is equal to the maximum year in the dataset.year equal to 2023.year == 2023.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 output Austria 2023 37860 <NA>
2 output Belgium 2023 37310 p
3 output Bulgaria 2023 7900 <NA>
4 output Switzerland 2023 63870 p
5 output Cyprus 2023 29080 p
6 output Czechia 2023 18480 <NA>
7 output Germany 2023 36290 p
8 output Denmark 2023 52510 <NA>
9 output Euro area - 19 countries (2015-2022) 2023 32340 <NA>
10 output Euro area – 20 countries (from 2023) 2023 32150 <NA>
# ℹ 62 more rows
arrange() function from the dplyr package.arrange()arrange() function expects one or more column names as arguments.arrange()sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 output Austria 2023 37860 <NA>
2 growth Austria 2023 -1.8 <NA>
3 output Belgium 2023 37310 p
4 growth Belgium 2023 0.4 p
5 output Bulgaria 2023 7900 <NA>
6 growth Bulgaria 2023 2.2 <NA>
7 output Croatia 2023 15020 p
8 growth Croatia 2023 2.7 p
9 output Cyprus 2023 29080 p
10 growth Cyprus 2023 1 p
# ℹ 62 more rows
arrange()sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo, unit)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 growth Austria 2023 -1.8 <NA>
2 output Austria 2023 37860 <NA>
3 growth Belgium 2023 0.4 p
4 output Belgium 2023 37310 p
5 growth Bulgaria 2023 2.2 <NA>
6 output Bulgaria 2023 7900 <NA>
7 growth Croatia 2023 2.7 p
8 output Croatia 2023 15020 p
9 growth Cyprus 2023 1 p
10 output Cyprus 2023 29080 p
# ℹ 62 more rows
R is vectorized (a vector-oriented language), it is more efficient to use a column-oriented data frame structure.print(n = 5) to display only the first five (instead of ten) rows and save some space.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo, unit) |>
print(n = 5)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 growth Austria 2023 -1.8 <NA>
2 output Austria 2023 37860 <NA>
3 growth Belgium 2023 0.4 p
4 output Belgium 2023 37310 p
5 growth Bulgaria 2023 2.2 <NA>
# ℹ 67 more rows
sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo, unit) |>
print(n = 5)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 growth Austria 2023 -1.8 <NA>
2 output Austria 2023 37860 <NA>
3 growth Belgium 2023 0.4 p
4 output Belgium 2023 37310 p
5 growth Bulgaria 2023 2.2 <NA>
# ℹ 67 more rows
OBS_VALE.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo, unit) |>
print(n = 5)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 growth Austria 2023 -1.8 <NA>
2 output Austria 2023 37860 <NA>
3 growth Belgium 2023 0.4 p
4 output Belgium 2023 37310 p
5 growth Bulgaria 2023 2.2 <NA>
# ℹ 67 more rows
OBS_VALE.sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo, unit) |>
print(n = 5)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 growth Austria 2023 -1.8 <NA>
2 output Austria 2023 37860 <NA>
3 growth Belgium 2023 0.4 p
4 output Belgium 2023 37310 p
5 growth Bulgaria 2023 2.2 <NA>
# ℹ 67 more rows
sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
arrange(geo, unit) |>
print(n = 5)# A tibble: 72 × 5
unit geo year OBS_VALUE flag
<chr> <chr> <dbl> <dbl> <chr>
1 growth Austria 2023 -1.8 <NA>
2 output Austria 2023 37860 <NA>
3 growth Belgium 2023 0.4 p
4 output Belgium 2023 37310 p
5 growth Bulgaria 2023 2.2 <NA>
# ℹ 67 more rows
R, we can easily pivot data using the tidyr package.
pivot_wider() function to pivot data from long to wide format.pivot_longer() function to pivot data from wide to long format.pivot_wider().pivot_wider() requires that we specify two arguments:
names_from),values_from).sdg |>
select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
filter(year == 2023) |>
pivot_wider(names_from = unit, values_from = OBS_VALUE)# A tibble: 36 × 5
geo year flag output growth
<chr> <dbl> <chr> <dbl> <dbl>
1 Austria 2023 <NA> 37860 -1.8
2 Belgium 2023 p 37310 0.4
3 Bulgaria 2023 <NA> 7900 2.2
4 Switzerland 2023 p 63870 -0.8
5 Cyprus 2023 p 29080 1
6 Czechia 2023 <NA> 18480 -1.2
7 Germany 2023 p 36290 -1.1
8 Denmark 2023 <NA> 52510 1.8
9 Euro area - 19 countries (2015-2022) 2023 <NA> 32340 -0.3
10 Euro area – 20 countries (from 2023) 2023 <NA> 32150 -0.2
# ℹ 26 more rows