An introduction to importing data

Comma-separated vales format. Importing data from the disk or the web. Relative and absolute paths. An overview of other data file formats.

Published

May 1st, 2025

1 Back at the sources

  • We used the readr package to load data from CSV files.
  • In this topic, we will take a bit closer look at the readr package and its functions.

1.1 A look at the CSV files

  • A very common way for sharing statistical data is via the Comma Separated Values (CSV) file format.
  • CSV data has a rectangular shape.
  • More often than not, they come with a header row specifying the column names.
  • Values are separated by a comma.
  • Alphanumeric values are usually, but not always, enclosed in double quotes.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • We observe that the estat_sdg_08_10_en.csv file has a header row, from which the initial column names of the transformation overview topic’s data frame were inferred.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • Values in each row are separated by commas.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • The values of some columns are not enclosed in double quotes.

A look at the CSV files

  • We can examine the first few lines of the estat_sdg_08_10_en.csv as an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,
  • While values in other columns are enclosed in double quotes.
  • Observe that the values of these columns contain commas in their text.
  • We enclose such values in double quotes to avoid confusion with the column separator of the CSV file.

2 Importing CSV files

  • We can easily import CSV-formatted data into R using the read_csv() function from the readr package.
library(readr)

2.1 Importing data with read_csv()

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(file = "data/estat_sdg_08_10_en.csv")
  • In this case, we indicate to read_csv() that the file is stored locally on our computer, in the data directory.
  • We can specify the file location as an absolute path, as a relative path, or as a web address.
  • In this case, we used a relative path.

2.2 Relative paths

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(file = "data/estat_sdg_08_10_en.csv")
  • Relative paths are resolved based on the current working directory.
  • For instance, if the current working directory is C:/Users/user, then the relative path we supplied points to the file:

C:/Users/user/data/estat_sdg_08_10_en.csv

2.3 Absolute paths

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(file = "/home/user/data/estat_sdg_08_10_en.csv")
  • Absolute paths are resolved based on the root directory of the file system.
  • If the file argument we supply starts with a /, then read_csv interprets it as an absolute path.
  • The file argument points to the exact location of the file in the file system.

2.4 Web addresses

  • Calling read_csv() requires specifying the location of the CSV file with the file argument.
read_csv(
  file = "https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/sdg_08_10/1.0"
)
  • Web addresses refer to files stored on remote servers.
  • If the file argument we supply starts with a protocol (e.g., http:// or https://), then read_csv interprets it as a web address.
  • The file is downloaded from the web and imported into R.

2.5 Reading CSV data

  • Calling read_csv() returns a tibble (data frame) with the contents of the CSV file.
  • We can assign the result to a variable to store the data frame in memory.
  • Assigning objects to variable names in R is done with the assignment operator: <-.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")

Reading CSV data

  • Calling read_csv() returns a tibble (data frame) with the contents of the CSV file.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
  • We can read the last expression from right to left as “the result of reading the CSV file is assigned to the variable sdg.”

Reading CSV data

  • By assigning the data to the sdg variable, we do not need to read the CSV file again to access the data.
sdg
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
read_csv(file = "data/estat_sdg_08_10_en.csv")
# A tibble: 1,845 × 9
   DATAFLOW        `LAST UPDATE` freq  unit  na_item geo   TIME_PERIOD OBS_VALUE
   <chr>           <chr>         <chr> <chr> <chr>   <chr>       <dbl>     <dbl>
 1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2000      1700
 2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2001      1850
 3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2002      1940
 4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2003      2060
 5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2004      2180
 6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2005      2310
 7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2006      2460
 8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2007      2630
 9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2008      2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba…        2009      2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>

2.6 A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • We are informed that 1845 rows and 9 columns were imported.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • The delimiter used to separate values is a comma.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Seven columns were parsed as character vectors (chr).
  • Two columns were parsed as double precision numbers (dbl). We can roughly consider them as real numbers.

A look at read_csv() messages

  • Calling read_csv() also prints some information.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv")
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • We can get more information about the data types using spec().
  • We can suppress this message with a couple of options.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(file = "data/estat_sdg_08_10_en.csv", show_col_types = FALSE)
  • One way is to pass show_col_types = FALSE to read_csv().
  • Another way is to manually specify (some of the) column types.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • Column types are specified as a list in the col_types argument.
  • Each item in the list specifies a column name and a corresponding type.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • The unit column is parsed as a factor (categorical variable).

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • The TIME_PERIOD column is parsed as an integer variable.

A look at read_csv() messages

  • We can suppress this message with a couple of options.
sdg <- read_csv(
  file = "data/estat_sdg_08_10_en.csv",
  col_types = list(
    unit = col_factor(),
    TIME_PERIOD = col_integer(),
    `LAST UPDATE` = col_datetime(format = "%d/%m/%y %H:%M:%S")
  )
)
  • The LAST UPDATE column is parsed as a date-time variable.

3 Importing files with other formats

  • The readr package provides a family of functions to import data from various file formats.
  • Their calling conventions are similar to read_csv().

3.1 Comma-separated data formats

Function Data Format Description
read_csv() CSV Reads comma-separated values
read_csv2() CSV Reads semicolon-separated (;) values
read_tsv() TSV Reads tab-separated values
read_delim() Delimited Separated Values Reads values separated by a custom delimiter

3.2 Other data formats

Function Data Format Description
read_fwf() Fixed Width Format Reads fixed-width formatted values
read_table() Table Special case of fixed-width format
read_log() Log files Reads log files