Tech news 5: Working with data files too large for memory

experience

data.table

spatial

dplyr

package

reproducibility

technews

Published

November 23, 2023

Modified

September 4, 2024

Today, I’d like to discuss data we might need that are larger than memory, or fit in memory but everything freezes…

It could be a very large data set such as BioTIME, GIS data, model results or simulation results that are accidentally huge but you still want to have a look to what’s inside.

Just a peek inside First, you could have a look inside a csv by reading only parts of it: loading only some columns and/or only some rows.

R> library(data.table)
R> fread(file = “data/big_file.csv”,
         select = c(“site”,”year”,”temperature”))
# select has its opposite argument drop

# Reading the column names and types ; empty columns
R> fread(file = “data/big_file.csv”, nrows = 0)

# Reading the first 100 rows
R> fread(file = “data/big_file.csv”, nrows = 100)

# Reading rows 10001 to 10101
R> fread(file = “data/big_file.csv”, skip = 10000,
         nrows = 100)

Using factors instead of character can save quite a lot of memory space too:

R> library(data.table)
R> fread(file = "data/communities_raw.csv",
         stringsAsFactors = FALSE) |>
    object.size() / 10^6
414.4 Mbytes
R> fread(file = "data/communities_raw.csv",
         stringsAsFactors = TRUE) |>
    object.size() / 10^6
271.1 Mbytes

The function is called fread because it reads fast, using several cores if available, it’s very smart at guessing types and it shows a progress bar on large files.

Smaller than memory but dplyr is slow?

Maybe your dplyr data wrangling step before getting analyses done is taking a few minutes or a few hours and you wouldn’t mind trying to speed things up without having to rewrite everything…

tidyverse developers too and they created dtplyr to help everyone with that. Add library(dtplyr) at the beginning of your script, lazy_dt(your_data) and bam all your dplyr verbs are going to be translated into data.table calls in the background, you won’t have to change anything else in your script… data.table may be faster thanks to two advantages: 1) fast distributed functions such as mean() and many others and 2) the ability to make operation by reference ie without your column having to be copied in a different place in the memory and a new spot being booked to write the result of your operation because it’s all done in the same place in memory.

Larger than memory using Arrow

Arrow is a cross-language multi-platform suite of tools to work on in-memory and larger-than-memory files written in C++.
You can use it to access large files and make your usual data wrangling operations on it, even using dplyr verbs.

First read your data with one of the arrow functions:

read_delim_arrow(): read a delimited text file
read_csv_arrow(): read a comma-separated values (CSV) file
read_tsv_arrow(): read a tab-separated values (TSV) file
read_parquet(): read a file in Parquet format
read_feather(): read a file in Arrow/Feather format

Arrow can read the whole data OR read some informations about the data set but without loading the data in memory, only column names, types, sizes, things like that.

Now you want to make operations on these data and R couldn’t because they don’t fit in memory but arrow is going to read your operations, translate them and execute them all at once when needed:

library(dplyr)
dset <- dset %>%
  group_by(subset) %>%
  summarize(mean_x = mean(x), min_y = min(y)) %>%
  filter(mean_x > 0) %>%
  arrange(subset)
# No operations were executed yet
dset %>% collect() # operations are executed and results given

Only once dplyr::collect() is called the operations are ran outside of R by arrow. Meaning that the workflow can be much longer and have (much) more intermediate steps but data are loaded in memory for R only when it needs them like for plotting or running a statistical analysis.

Spatial Data

Here we will be looking at sf, stars (for rasters) and dbplyr (as in databases…) and it is a little more advanced and specialised so I won’t go into much details but a few things I liked: Cropping a spatial object even before loading it into R using the wkt_filter argument of sf::st_read()

library(sf)
file <- "data/nc.gpkg"
c(xmin = -82,ymin = 36, xmax = -80, ymax = 37) |>
    st_bbox() |> st_as_sfc() |> st_as_text() -> bb
st_read(file, wkt_filter = bb) |> nrow()
17 # out of 100

Even easier if you can write SQL queries directly:

q <- "select BIR74,SID74,geom from 'nc.gpkg' where BIR74 > 1500"
read_sf(file, query = q) |> nrow()
61 # out of 100

Using stars, you can read a raster file without loading it into memory, this is quite similar to arrow in the previous section, and a 100+Mbytes file results in a 12Mbytes object in memory in R.

Other interesting tools

It seems that packages dedicated to running statistical models (lm, glm, etc) directly on data sets too big for memory were a thing a few years ago but I can’t find recent packages targeting this problem… biglm hasn’t been updated since 2020.

Great resources

dtplyr

https://dtplyr.tidyverse.org/

Arrow and dplyr

Spatial data

https://r-spatial.org/book/09-Large.html

Happy to talk about it, share experiences, help you implement something (don’t you also think dtplyr sounds like a great and easy tool?!) and hear your comments!

Best wishes,

Alban