An introduction to data transformation

Overview of data transformation activities. Data frames. Distinct values. Selecting, renaming, and mutating variables. Filtering and arranging data. Data pivoting.

Published

April 19th, 2026

1 Behind the jump start

In the visualization overview topic, we used the eu_ict dataset to create various visualizations.

In this section, we will construct a part of the eu_ict dataset from scratch.
During this process, we will introduce the basic concepts of data transformation in R.

2 Transforming data

Data transformation is the process of converting data from one format or structure into another.
We will use the dplyr and tidyr packages to perform data transformations.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)

2.1 Transformation types

The dplyr package provides three main types of data transformations:
1. Row-wise transformations apply a function to each row of one or more columns.
2. Column-wise transformations apply a function to one or more columns of a data frame.
3. Group-wise transformations apply a function to groups (subsets) of rows.

Transformation types

The tidyr package provides two main types of data transformations:
1. Widening transforms data from long (many rows, few columns) to wide (few rows, many columns) format.
2. Lengthening transforms data from wide to long format.

3 Original data sources

The eu_ict dataset combines data from three Eurostat datasets:
1. Real GDP per capita (Eurostat 2025c)
2. Employed ICT specialists by sex (Eurostat 2025a)
3. Employment by sex, age, and citizenship (Eurostat 2025b)

Original data sources

We will use the readr package to read the data from the original sources.
We leave the details of data import operations for a later stage.

library(readr)
sdg <- read_csv("data/estat_sdg_08_10_en.csv")

Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

isoc <- read_csv("data/estat_isoc_sks_itspt_en.csv")

Rows: 1308 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): DATAFLOW, LAST UPDATE, freq, unit, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

lfsa <- read_csv("data/estat_lfsa_egan_en.csv")

Rows: 496671 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): DATAFLOW, LAST UPDATE, freq, unit, sex, age, citizen, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4 A first look at the GDP data

We inspect the sdg data frame.

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

4.1 Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

The output informs us that sdg is stored as a tibble.

Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

It has 1,845 rows and 9 columns.

Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

The first eight columns are DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, TIME_PERIOD, OBS_VALUE.

Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

Their corresponding types are character, character, character, character, character, character, numeric, numeric.

Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

The output displays the first 10 rows of the data frame.

Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

The rows are enumerated starting from 1.

Reading `tibble`’s from standard output

sdg

# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

Finally, we are informed that the data frame has 1 more variable, OBS_FLAG, with type character.

4.2 The GDP dataset variables

We can learn more about the dataset’s variables by visiting the Eurostat website, using the DOI of (Eurostat 2025c).

The variables unit, geo, TIME_PERIOD, OBS_VALUE, and OBS_FLAG contain information of interest.
The variables DATAFLOW, LAST UPDATE, freq, and na_item contain metadata and have no variation within the dataset.

5 Finding distinct values

The variables DATAFLOW, LAST UPDATE, freq, and na_item contain metadata and have no variation within the dataset.
We can verify this using the dplyr function distinct(), and, at the same time, perform our first data transformation operation.

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

Finding distinct values

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

The function distinct() returns the rows of the data frame with unique combinations of the specified columns.

The first argument of the distinct() is the data frame.
The remaining arguments are the columns for which we want to find distinct values.

Finding distinct values

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

We can verify that there is only one unique combination of DATAFLOW and LAST UPDATE values in the sdg dataset.

5.1 A closer look at `distinct()`’s result

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

The result of distinct() contains only the specified columns.

We can instruct distinct() to keep all columns by using the .keep_all argument.
Then, distinct() returns the first row of each unique combination of the specified columns.

A closer look at `distinct()`’s result

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item, .keep_all = TRUE)

# A tibble: 1 × 9
  DATAFLOW         `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
  <chr>            <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
1 ESTAT:SDG_08_10… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
# ℹ 1 more variable: OBS_FLAG <chr>

We can instruct distinct() to keep all columns by using the .keep_all argument.
Then, distinct() returns the first row of each unique combination of the specified columns.

5.2 A closer look at `distinct()`’s arguments

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

In the distinct() call, the LAST UPDATE column is enclosed in backticks.

This is because the LAST UPDATE column name contains a space.
Variable names in R cannot contain spaces.
While such names are commonly found in datasets in the wild, they are non-syntactic names in R.

A closer look at `distinct()`’s arguments

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

In the distinct() call, the LAST UPDATE column is enclosed in backticks.

We enclose non-syntactic names in backticks to use them in dplyr calls or other R operations.

5.3 Programming digression: the pipe operator

distinct(sdg, DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

We have called distinct() using the usual parenthesized notation.

I.e., for any function \(f\) and arguments \(x_{1}\), \(x_{2}\), \(\ldots\), \(x_{n}\), we write \(f(x_{1}, x_{2}, \ldots, x_{n})\).

Programming digression: the pipe operator

sdg |> distinct(DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

However, the most common way to call dplyr functions is by using the pipe operator |>.

I.e., for any function \(f\) and arguments \(x_{1}\), \(x_{2}\), \(\ldots\), \(x_{n}\), we write \(x_{1} |> f(x_{2}, \ldots, x_{n})\).

Programming digression: the pipe operator

sdg |> distinct(DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

The pipe operator takes the output of the left-hand side operation and passes it as the first argument of the right-hand side call.

Programming digression: the pipe operator

sdg |>
  distinct(DATAFLOW, `LAST UPDATE`, freq, na_item)

# A tibble: 1 × 4
  DATAFLOW             `LAST UPDATE`     freq   na_item                         
  <chr>                <chr>             <chr>  <chr>                           
1 ESTAT:SDG_08_10(1.0) 20/12/24 11:00:00 Annual Gross domestic product at marke…

The pipe operator takes the output of the left-hand side operation and passes it as the first argument of the right-hand side call.
It is customary to insert a newline after the pipe operator to improve readability.

Programming digression: the pipe operator

The pipe operator shines when we chain multiple operations.

function1(
  function2(
    function3(
      function4(
        data,
        arg4_1,
        arg4_2
      ),
      arg3_1
    ),
    arg2_1,
    arg2_2,
    arg2_3
  ),
  arg1_1
)

data |>
  function4(arg4_1, arg4_2) |>
  function3(arg3_1) |>
  function2(arg2_1, arg2_2, arg2_3) |>
  function1(arg1_1)

6 Selecting variables of interest

We have established that variables DATAFLOW, LAST UPDATE, freq, and na_item are not very relevant for our analysis.

We want to focus on the remaining variables.
We can use the select() function to keep only the variables we are interested in.

Selecting variables of interest

We can use the select() function to keep only the variables we are interested in.

There are multiple ways to achieve the desired result.

6.1 Explicit selection

First, we can explicitly list the variables we want to keep.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG)

# A tibble: 1,845 × 5
   unit                                     geo   TIME_PERIOD OBS_VALUE OBS_FLAG
   <chr>                                    <chr>       <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per c… Alba…        2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per c… Alba…        2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per c… Alba…        2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per c… Alba…        2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per c… Alba…        2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per c… Alba…        2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per c… Alba…        2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per c… Alba…        2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per c… Alba…        2008      2850 d       
10 Chain linked volumes (2010), euro per c… Alba…        2009      2960 <NA>    
# ℹ 1,835 more rows

6.2 Vector selection

Second, we can concatenate the variables we wish to keep in a vector and pass it to the select() function.

In R, we concatenate objects using the c() function.

For example:

c(1, 2, 3, 4)

[1] 1 2 3 4

c(1.2, 10 / 3, pi)

[1] 1.200000 3.333333 3.141593

c("a", "b", "c")

[1] "a" "b" "c"

Vector selection

Second, we can concatenate the variables we wish to keep in a vector and pass it to the select() function.

sdg |>
  select(c(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG))

# A tibble: 1,845 × 5
   unit                                     geo   TIME_PERIOD OBS_VALUE OBS_FLAG
   <chr>                                    <chr>       <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per c… Alba…        2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per c… Alba…        2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per c… Alba…        2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per c… Alba…        2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per c… Alba…        2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per c… Alba…        2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per c… Alba…        2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per c… Alba…        2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per c… Alba…        2008      2850 d       
10 Chain linked volumes (2010), euro per c… Alba…        2009      2960 <NA>    
# ℹ 1,835 more rows

Vector selection

Second, we can concatenate the variables we wish to keep in a vector and pass it to the select() function.

sdg |>
  select(c(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG))

This approach is not very useful when we are listing variables we want to keep because using c() is superfluous.

However, it is useful when we intend to exclude variables.

6.3 Selection via exclusion

Third, we can exclude the variables we would like to drop.

sdg |>
  select(!c(DATAFLOW, `LAST UPDATE`, freq, na_item))

# A tibble: 1,845 × 5
   unit                                     geo   TIME_PERIOD OBS_VALUE OBS_FLAG
   <chr>                                    <chr>       <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per c… Alba…        2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per c… Alba…        2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per c… Alba…        2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per c… Alba…        2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per c… Alba…        2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per c… Alba…        2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per c… Alba…        2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per c… Alba…        2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per c… Alba…        2008      2850 d       
10 Chain linked volumes (2010), euro per c… Alba…        2009      2960 <NA>    
# ℹ 1,835 more rows

6.4 A closer look at `select()`’s results

The select() function operates on the columns of a data frame.
It returns a new data frame with only the columns specified in the function call.
It does not change the rows of the data frame.
We can verify this using the nrow() function, which returns the number of rows in a data frame.

nrow(sdg)

[1] 1845

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  nrow()

[1] 1845

7 Renaming variables

The variable TIME_PERIOD is among the variables we selected.
The values of this variable indicate the year in which the GDP values of an observation were recorded.
How can we rename this variable to year?

7.1 Using `rename()`

The rename() function of package dplyr can be used to rename variables.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD)

We use a keyword argument to specify the operation we want to perform.
The new name of the variable is on the left-hand side of =.
The old name of the variable is on the right-hand side of =.

Using `rename()`

The rename() function of package dplyr can be used to rename variables.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD)

# A tibble: 1,845 × 5
   unit                                         geo      year OBS_VALUE OBS_FLAG
   <chr>                                        <chr>   <dbl>     <dbl> <chr>   
 1 Chain linked volumes (2010), euro per capita Albania  2000      1700 <NA>    
 2 Chain linked volumes (2010), euro per capita Albania  2001      1850 <NA>    
 3 Chain linked volumes (2010), euro per capita Albania  2002      1940 <NA>    
 4 Chain linked volumes (2010), euro per capita Albania  2003      2060 <NA>    
 5 Chain linked volumes (2010), euro per capita Albania  2004      2180 <NA>    
 6 Chain linked volumes (2010), euro per capita Albania  2005      2310 <NA>    
 7 Chain linked volumes (2010), euro per capita Albania  2006      2460 <NA>    
 8 Chain linked volumes (2010), euro per capita Albania  2007      2630 <NA>    
 9 Chain linked volumes (2010), euro per capita Albania  2008      2850 d       
10 Chain linked volumes (2010), euro per capita Albania  2009      2960 <NA>    
# ℹ 1,835 more rows

Using `rename()`

We can use rename() to rename multiple variables at once.

For example, we can simultaneously rename TIME_PERIOD and OBS_FLAG to year and flag, respectively.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG)

Using `rename()`

We can use rename() to rename multiple variables at once.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG)

# A tibble: 1,845 × 5
   unit                                         geo      year OBS_VALUE flag 
   <chr>                                        <chr>   <dbl>     <dbl> <chr>
 1 Chain linked volumes (2010), euro per capita Albania  2000      1700 <NA> 
 2 Chain linked volumes (2010), euro per capita Albania  2001      1850 <NA> 
 3 Chain linked volumes (2010), euro per capita Albania  2002      1940 <NA> 
 4 Chain linked volumes (2010), euro per capita Albania  2003      2060 <NA> 
 5 Chain linked volumes (2010), euro per capita Albania  2004      2180 <NA> 
 6 Chain linked volumes (2010), euro per capita Albania  2005      2310 <NA> 
 7 Chain linked volumes (2010), euro per capita Albania  2006      2460 <NA> 
 8 Chain linked volumes (2010), euro per capita Albania  2007      2630 <NA> 
 9 Chain linked volumes (2010), euro per capita Albania  2008      2850 d    
10 Chain linked volumes (2010), euro per capita Albania  2009      2960 <NA> 
# ℹ 1,835 more rows

8 Mutating variables

If we examine the unit variable in the resulting data frame, we observe it takes two distinct values.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  distinct(unit)

# A tibble: 2 × 1
  unit                                                                  
  <chr>                                                                 
1 Chain linked volumes (2010), euro per capita                          
2 Chain linked volumes, percentage change on previous period, per capita

These values are very informative, but they might be a bit verbose for our analysis.

8.1 Using `mutate()`

The dplyr’s mutate() function allows us to modify existing variables or create new variables based on the existing ones.
We can use mutate() by passing one or more keyword arguments that specify the desired transformations.

8.2 Specifying a `mutate()` transformation

In our case, we want to use mutate to map the values of the unit such that:
- “Chain linked volumes, percentage change on previous period, per capita” is mapped to “growth”, and
- “Chain linked volumes (2010), euro per capita” is mapped to “output”.

Specifying a `mutate()` transformation

Equivalently, we want to use mutate to map the values of the unit such that for each value \(x\)
- \(f(x)\) = “growth” if x = “Chain linked volumes, percentage change on previous period, per capita”, and
- \(f(x)\) = “output” if x = “Chain linked volumes (2010), euro per capita”.

Specifying a `mutate()` transformation

Equivalently, we want to use mutate to map the values of the unit such that for each value \(x\)
- \(f(x)\) = “growth” if x contains the word “percentage”, and
- \(f(x)\) = “output” otherwise.

Specifying a `mutate()` transformation

In pseudocode, we can procedurally describe our transformation as follows:

if x contains "percentage":
  return "growth"
else
  return "output"

Conveniently, with some preparation, we can write this transformation as a single-liner in R!

8.3 Programming digression: using `grepl()`

We can use the (obscurely named?) base R function grepl() to check if a value contains the word “percentage”.
This is a powerful function that can be used in many ways, but for now:
- The function receives an argument pattern that specifies the word we are looking for, and
- an argument x that specifies the value we are examining.
- It returns TRUE if the pattern is found, and FALSE otherwise.

Programming digression: using `grepl()`

For example:

grepl(pattern = "percentage", x = "does not have it")

[1] FALSE

grepl(pattern = "percentage", x = "has percentage")

[1] TRUE

If one remembers the correct order of the arguments, their keywords can be omitted, i.e.,

grepl("percentage", "does not have it")

[1] FALSE

grepl("percentage", "has percentage")

[1] TRUE

Programming digression: using `grepl()`

With grepl(), we can rewrite our pseudocode in syntactically valid R code:

if (grepl("percentage", x)) {
  "growth"
} else {
  "output"
}

However, the last code snippet contains an if-else control structure, which cannot be directly used in mutate().

8.4 Programming digression: using `ifelse()`

We can use the base R function ifelse(), which provides a functional form of the if-else control structure.
The function receives three arguments:
- The first argument is a logical value.
- The second argument is the value to return if the logical value is TRUE.
- The third argument is the value to return if the logical value is FALSE.

Programming digression: using `ifelse()`

For example:

ifelse(
  grepl("percentage", "does not have it"),
  "Found it!",
  "Nope"
)

[1] "Nope"

ifelse(
  grepl("percentage", "has percentage"),
  "Found it!",
  "Nope"
)

[1] "Found it!"

Programming digression: using `ifelse()`

With grepl() and ifelse combined, we can rewrite our pseudocode in a single-line syntactically valid R code:

ifelse(grepl("percentage", x), "growth", "output")

8.5 A first `mutate()` transformation

We are now ready to use mutate() to simplify the values of the unit variable.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output"))

The mutate() function receives a keyword argument that specifies the transformation we want to perform.
Similarly to rename(), the new name or the modified variable name is at the left-hand side of =.
The transformation is specified at the right-hand side of = and can use one or more existing variables.

A first `mutate()` transformation

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output"))

# A tibble: 1,845 × 5
   unit   geo      year OBS_VALUE flag 
   <chr>  <chr>   <dbl>     <dbl> <chr>
 1 output Albania  2000      1700 <NA> 
 2 output Albania  2001      1850 <NA> 
 3 output Albania  2002      1940 <NA> 
 4 output Albania  2003      2060 <NA> 
 5 output Albania  2004      2180 <NA> 
 6 output Albania  2005      2310 <NA> 
 7 output Albania  2006      2460 <NA> 
 8 output Albania  2007      2630 <NA> 
 9 output Albania  2008      2850 d    
10 output Albania  2009      2960 <NA> 
# ℹ 1,835 more rows

9 Filtering data

A filtering operation is a transformation that removes rows from a dataset based on a logical condition.

Filtering data

The so-far transformed data frame contains observations for multiple years.
We can use the base R function range() to find the range of years in the dataset.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  distinct(year) |>
  range()

[1] 2000 2023

If we want to focus only on the latest year for analysis, how can we remove the observations of the other years?

9.1 Using `filter()`

The filter() function of the dplyr package can be used to filter rows.
The function takes a logical condition as an argument.
The condition is evaluated for each row in the dataset.
- If the condition is TRUE, the row is kept.
- Otherwise, the row is filtered out.

9.2 Programming digression: numerical logical conditions in `R`

Basic numerical logical conditions in R can be specified using the following operators:
- == (equal to)
- != (not equal to)
- > (greater than)
- >= (greater than or equal to)
- < (less than)
- <= (less than or equal to)

9.3 Specifying a filtering condition

In our case, we want to keep only the rows where the year variable is equal to the maximum year in the dataset.
Namely, we aim to keep only the observations with year equal to 2023.
Or syntactically, we aim to keep rows that satisfy the condition year == 2023.

9.4 A first filtering operation

Thus, we can filter out observations with years apart from 2023 as follows:

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023)

# A tibble: 72 × 5
   unit   geo                                    year OBS_VALUE flag 
   <chr>  <chr>                                 <dbl>     <dbl> <chr>
 1 output Austria                                2023     37860 <NA> 
 2 output Belgium                                2023     37310 p    
 3 output Bulgaria                               2023      7900 <NA> 
 4 output Switzerland                            2023     63870 p    
 5 output Cyprus                                 2023     29080 p    
 6 output Czechia                                2023     18480 <NA> 
 7 output Germany                                2023     36290 p    
 8 output Denmark                                2023     52510 <NA> 
 9 output Euro area - 19 countries  (2015-2022)  2023     32340 <NA> 
10 output Euro area – 20 countries (from 2023)   2023     32150 <NA> 
# ℹ 62 more rows

10 Arranging data

Arranging transformations are row-wise operations that change the order of rows in a dataset based on the values of one or more columns.

10.1 An arranging example

Suppose we want to arrange the data of the so-far transformed dataset so that observations appear in ascending (lexicographic) order with respect to the country names.

We can use the arrange() function from the dplyr package.

10.2 Using `arrange()`

The arrange() function expects one or more column names as arguments.
First, it sorts the dataset by the first provided column.
If there are ties, it sorts by the second column, and so on.
By default, observations are sorted in ascending order.

Using `arrange()`

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo)

# A tibble: 72 × 5
   unit   geo       year OBS_VALUE flag 
   <chr>  <chr>    <dbl>     <dbl> <chr>
 1 output Austria   2023   37860   <NA> 
 2 growth Austria   2023      -1.8 <NA> 
 3 output Belgium   2023   37310   p    
 4 growth Belgium   2023       0.4 p    
 5 output Bulgaria  2023    7900   <NA> 
 6 growth Bulgaria  2023       2.2 <NA> 
 7 output Croatia   2023   15020   p    
 8 growth Croatia   2023       2.7 p    
 9 output Cyprus    2023   29080   p    
10 growth Cyprus    2023       1   p    
# ℹ 62 more rows

Using `arrange()`

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit)

# A tibble: 72 × 5
   unit   geo       year OBS_VALUE flag 
   <chr>  <chr>    <dbl>     <dbl> <chr>
 1 growth Austria   2023      -1.8 <NA> 
 2 output Austria   2023   37860   <NA> 
 3 growth Belgium   2023       0.4 p    
 4 output Belgium   2023   37310   p    
 5 growth Bulgaria  2023       2.2 <NA> 
 6 output Bulgaria  2023    7900   <NA> 
 7 growth Croatia   2023       2.7 p    
 8 output Croatia   2023   15020   p    
 9 growth Cyprus    2023       1   p    
10 output Cyprus    2023   29080   p    
# ℹ 62 more rows

11 Data pivoting

Information is often not uniquely represented by a single data frame structure.

Two or more data structures can represent the same information.
While from a conceptual standpoint, this poses no problem as the information is the same, from a practical standpoint, the representation can be very impactful.

11.1 Data frame structures

What is the most suitable data frame structure?

To answer this question, it is helpful to contemplate the following two questions:
1. Which data frame structure is most convenient for the analysis?
2. Which data frame structure is most appropriate for the computational environment?

Data frame structures

Which data frame structure is most convenient for accessing information relevant to the analysis?

The answer to the first question is case-specific.
As a general rule of thumb, the most convenient data frame structure is the one that allows for the most straightforward access to the information relevant to the analysis.

Data frame structures

Which data frame structure is most appropriate for the computational environment?

The answer to the second question depends on the tools we use.
Because R is vectorized (a vector-oriented language), it is more efficient to use a column-oriented data frame structure.
I.e., it is usually a good idea to have the information we use in computations frequently as columns (variables).

11.2 Wide and long formats

Examining the so-far transformed data frame once more, we observe that each country appears in two rows.

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)

Here, we use print(n = 5) to display only the first five (instead of ten) rows and save some space.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)

# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)

# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows

One row holds growth data in OBS_VALUE.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)

# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows

The other holds output data in OBS_VALUE.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)

# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows

The data is in the so-called long format.

Wide and long formats

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  arrange(geo, unit) |>
  print(n = 5)

# A tibble: 72 × 5
  unit   geo       year OBS_VALUE flag 
  <chr>  <chr>    <dbl>     <dbl> <chr>
1 growth Austria   2023      -1.8 <NA> 
2 output Austria   2023   37860   <NA> 
3 growth Belgium   2023       0.4 p    
4 output Belgium   2023   37310   p    
5 growth Bulgaria  2023       2.2 <NA> 
# ℹ 67 more rows

We want to transform the data into a wide format so that each variable appears in a separate column.

11.3 Pivoting, widening, and lengthening

The process of transforming data from long to wide format or vice versa is called data pivoting.
Transforming data from long to wide format is called widening.
Transforming data from wide to long format is called lengthening.

Pivoting, widening, and lengthening

In R, we can easily pivot data using the tidyr package.
- We can use the pivot_wider() function to pivot data from long to wide format.
- We can use the pivot_longer() function to pivot data from wide to long format.

11.4 A widening example

In our case, we want to widen the data, so we will use pivot_wider().

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  pivot_wider(names_from = unit, values_from = OBS_VALUE)

Using pivot_wider() requires that we specify two arguments:
1. Existing variable(s) that hold column names (names_from),
2. Existing variable(s) that hold values (values_from).

A widening example

sdg |>
  select(unit, geo, TIME_PERIOD, OBS_VALUE, OBS_FLAG) |>
  rename(year = TIME_PERIOD, flag = OBS_FLAG) |>
  mutate(unit = ifelse(grepl("percentage", unit), "growth", "output")) |>
  filter(year == 2023) |>
  pivot_wider(names_from = unit, values_from = OBS_VALUE)

# A tibble: 36 × 5
   geo                                    year flag  output growth
   <chr>                                 <dbl> <chr>  <dbl>  <dbl>
 1 Austria                                2023 <NA>   37860   -1.8
 2 Belgium                                2023 p      37310    0.4
 3 Bulgaria                               2023 <NA>    7900    2.2
 4 Switzerland                            2023 p      63870   -0.8
 5 Cyprus                                 2023 p      29080    1  
 6 Czechia                                2023 <NA>   18480   -1.2
 7 Germany                                2023 p      36290   -1.1
 8 Denmark                                2023 <NA>   52510    1.8
 9 Euro area - 19 countries  (2015-2022)  2023 <NA>   32340   -0.3
10 Euro area – 20 countries (from 2023)   2023 <NA>   32150   -0.2
# ℹ 26 more rows

References

Eurostat. 2025a. “Employed ICT Specialists by Sex.” https://doi.org/10.2908/isoc_sks_itsps.

———. 2025b. “Employment by Sex, Age and Citizenship.” https://doi.org/10.2908/isoc_sks_itsps.

———. 2025c. “Real GDP Per Capita.” https://doi.org/10.2908/sdg_08_10.

1 Behind the jump start

2 Transforming data

2.1 Transformation types

Transformation types

3 Original data sources

Original data sources

4 A first look at the GDP data

4.1 Reading tibble’s from standard output

Reading tibble’s from standard output

Reading tibble’s from standard output

Reading tibble’s from standard output

Reading tibble’s from standard output

Reading tibble’s from standard output

Reading tibble’s from standard output

4.2 The GDP dataset variables

5 Finding distinct values

Finding distinct values

Finding distinct values

5.1 A closer look at distinct()’s result

A closer look at distinct()’s result

5.2 A closer look at distinct()’s arguments

A closer look at distinct()’s arguments

5.3 Programming digression: the pipe operator

Programming digression: the pipe operator

Programming digression: the pipe operator

Programming digression: the pipe operator

Programming digression: the pipe operator

6 Selecting variables of interest

Selecting variables of interest

6.1 Explicit selection

6.2 Vector selection

Vector selection

Vector selection

6.3 Selection via exclusion

6.4 A closer look at select()’s results

7 Renaming variables

7.1 Using rename()

Using rename()

Using rename()

Using rename()

8 Mutating variables

8.1 Using mutate()

8.2 Specifying a mutate() transformation

Specifying a mutate() transformation

Specifying a mutate() transformation

Specifying a mutate() transformation

8.3 Programming digression: using grepl()

Programming digression: using grepl()

Programming digression: using grepl()

8.4 Programming digression: using ifelse()

Programming digression: using ifelse()

Programming digression: using ifelse()

8.5 A first mutate() transformation

A first mutate() transformation

9 Filtering data

Filtering data

9.1 Using filter()

9.2 Programming digression: numerical logical conditions in R

9.3 Specifying a filtering condition

9.4 A first filtering operation

10 Arranging data

10.1 An arranging example

10.2 Using arrange()

Using arrange()

Using arrange()

11 Data pivoting

11.1 Data frame structures

Data frame structures

Data frame structures

11.2 Wide and long formats

Wide and long formats

Wide and long formats

Wide and long formats

Wide and long formats

Wide and long formats

11.3 Pivoting, widening, and lengthening

Pivoting, widening, and lengthening

11.4 A widening example

A widening example

References

4.1 Reading `tibble`’s from standard output

Reading `tibble`’s from standard output

Reading `tibble`’s from standard output

Reading `tibble`’s from standard output

Reading `tibble`’s from standard output

Reading `tibble`’s from standard output

Reading `tibble`’s from standard output

5.1 A closer look at `distinct()`’s result

A closer look at `distinct()`’s result

5.2 A closer look at `distinct()`’s arguments

A closer look at `distinct()`’s arguments

6.4 A closer look at `select()`’s results

7.1 Using `rename()`

Using `rename()`

Using `rename()`

Using `rename()`

8.1 Using `mutate()`

8.2 Specifying a `mutate()` transformation

Specifying a `mutate()` transformation

Specifying a `mutate()` transformation

Specifying a `mutate()` transformation

8.3 Programming digression: using `grepl()`

Programming digression: using `grepl()`

Programming digression: using `grepl()`

8.4 Programming digression: using `ifelse()`

Programming digression: using `ifelse()`

Programming digression: using `ifelse()`

8.5 A first `mutate()` transformation

A first `mutate()` transformation

9.1 Using `filter()`

9.2 Programming digression: numerical logical conditions in `R`

10.2 Using `arrange()`

Using `arrange()`

Using `arrange()`