An introduction to importing data
Comma-separated vales format. Importing data from the disk or the web. Relative and absolute paths. An overview of other data file formats.
1 Back at the sources
- In the transformation overview topic, we have imported the original Eurostat data without dwelling too much on the details.
- We used the
readrpackage to load data from CSV files. - In this topic, we will take a bit closer look at the
readrpackage and its functions.
1.1 A look at the CSV files
- A very common way for sharing statistical data is via the Comma Separated Values (CSV) file format.
- CSV data has a rectangular shape.
- More often than not, they come with a header row specifying the column names.
- Values are separated by a comma.
- Alphanumeric values are usually, but not always, enclosed in double quotes.
A look at the CSV files
- We can examine the first few lines of the
estat_sdg_08_10_en.csvas an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,A look at the CSV files
- We can examine the first few lines of the
estat_sdg_08_10_en.csvas an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,- We observe that the
estat_sdg_08_10_en.csvfile has a header row, from which the initial column names of the transformation overview topic’s data frame were inferred.
A look at the CSV files
- We can examine the first few lines of the
estat_sdg_08_10_en.csvas an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,- Values in each row are separated by commas.
A look at the CSV files
- We can examine the first few lines of the
estat_sdg_08_10_en.csvas an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,- The values of some columns are not enclosed in double quotes.
A look at the CSV files
- We can examine the first few lines of the
estat_sdg_08_10_en.csvas an example.
DATAFLOW,LAST UPDATE,freq,unit,na_item,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2000,1700,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2001,1850,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2002,1940,
ESTAT:SDG_08_10(1.0),20/12/24 11:00:00,Annual,"Chain linked volumes (2010), euro per capita",Gross domestic product at market prices,Albania,2003,2060,- While values in other columns are enclosed in double quotes.
- Observe that the values of these columns contain commas in their text.
- We enclose such values in double quotes to avoid confusion with the column separator of the CSV file.
2 Importing CSV files
- We can easily import CSV-formatted data into
Rusing theread_csv()function from thereadrpackage.
2.1 Importing data with read_csv()
- Calling
read_csv()requires specifying the location of the CSV file with thefileargument.
- In this case, we indicate to
read_csv()that the file is stored locally on our computer, in thedatadirectory. - We can specify the file location as an absolute path, as a relative path, or as a web address.
- In this case, we used a relative path.
2.2 Relative paths
- Calling
read_csv()requires specifying the location of the CSV file with thefileargument.
- Relative paths are resolved based on the current working directory.
- For instance, if the current working directory is
C:/Users/user, then the relative path we supplied points to the file:
C:/Users/user/data/estat_sdg_08_10_en.csv
2.3 Absolute paths
- Calling
read_csv()requires specifying the location of the CSV file with thefileargument.
- Absolute paths are resolved based on the root directory of the file system.
- If the
fileargument we supply starts with a/, thenread_csvinterprets it as an absolute path. - The
fileargument points to the exact location of the file in the file system.
2.4 Web addresses
- Calling
read_csv()requires specifying the location of the CSV file with thefileargument.
- Web addresses refer to files stored on remote servers.
- If the
fileargument we supply starts with a protocol (e.g.,http://orhttps://), thenread_csvinterprets it as a web address. - The file is downloaded from the web and imported into
R.
2.5 Reading CSV data
- Calling
read_csv()returns a tibble (data frame) with the contents of the CSV file.
- We can assign the result to a variable to store the data frame in memory.
- Assigning objects to variable names in
Ris done with the assignment operator:<-.
Reading CSV data
- Calling
read_csv()returns a tibble (data frame) with the contents of the CSV file.
- We can read the last expression from right to left as “the result of reading the CSV file is assigned to the variable
sdg.”
Reading CSV data
- By assigning the data to the
sdgvariable, we do not need to read the CSV file again to access the data.
# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
# A tibble: 1,845 × 9
DATAFLOW `LAST UPDATE` freq unit na_item geo TIME_PERIOD OBS_VALUE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2000 1700
2 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2001 1850
3 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2002 1940
4 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2003 2060
5 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2004 2180
6 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2005 2310
7 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2006 2460
8 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2007 2630
9 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2008 2850
10 ESTAT:SDG_08_1… 20/12/24 11:… Annu… Chai… Gross … Alba… 2009 2960
# ℹ 1,835 more rows
# ℹ 1 more variable: OBS_FLAG <chr>
2.6 A look at read_csv() messages
- Calling
read_csv()also prints some information.
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A look at read_csv() messages
- Calling
read_csv()also prints some information.
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- We are informed that 1845 rows and 9 columns were imported.
A look at read_csv() messages
- Calling
read_csv()also prints some information.
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- The delimiter used to separate values is a comma.
A look at read_csv() messages
- Calling
read_csv()also prints some information.
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Seven columns were parsed as character vectors (
chr). - Two columns were parsed as double precision numbers (
dbl). We can roughly consider them as real numbers.
A look at read_csv() messages
- Calling
read_csv()also prints some information.
Rows: 1845 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): DATAFLOW, LAST UPDATE, freq, unit, na_item, geo, OBS_FLAG
dbl (2): TIME_PERIOD, OBS_VALUE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- We can get more information about the data types using
spec(). - We can suppress this message with a couple of options.
A look at read_csv() messages
- We can suppress this message with a couple of options.
- One way is to pass
show_col_types = FALSEtoread_csv().
- Another way is to manually specify (some of the) column types.
A look at read_csv() messages
- We can suppress this message with a couple of options.
- Column types are specified as a list in the
col_typesargument. - Each item in the list specifies a column name and a corresponding type.
A look at read_csv() messages
- We can suppress this message with a couple of options.
- The
unitcolumn is parsed as a factor (categorical variable).
A look at read_csv() messages
- We can suppress this message with a couple of options.
- The
TIME_PERIODcolumn is parsed as an integer variable.
A look at read_csv() messages
- We can suppress this message with a couple of options.
- The
LAST UPDATEcolumn is parsed as a date-time variable.
3 Importing files with other formats
- The
readrpackage provides a family of functions to import data from various file formats. - Their calling conventions are similar to
read_csv().
3.1 Comma-separated data formats
| Function | Data Format | Description |
|---|---|---|
read_csv() |
CSV | Reads comma-separated values |
read_csv2() |
CSV | Reads semicolon-separated (;) values |
read_tsv() |
TSV | Reads tab-separated values |
read_delim() |
Delimited Separated Values | Reads values separated by a custom delimiter |
3.2 Other data formats
| Function | Data Format | Description |
|---|---|---|
read_fwf() |
Fixed Width Format | Reads fixed-width formatted values |
read_table() |
Table | Special case of fixed-width format |
read_log() |
Log files | Reads log files |