Tech news 5: Working with data files too large for memory
Today, I’d like to discuss data we might need that are larger than memory, or fit in memory but everything freezes…
It could be a very large data set such as BioTIME, GIS data, model results or simulation results that are accidentally huge but you still want to have a look to what’s inside.
Just a peek inside First, you could have a look inside a csv by reading only parts of it: loading only some columns and/or only some rows.
> library(data.table)
R> fread(file = “data/big_file.csv”,
Rselect = c(“site”,”year”,”temperature”))
# select has its opposite argument drop
# Reading the column names and types ; empty columns
> fread(file = “data/big_file.csv”, nrows = 0)
R
# Reading the first 100 rows
> fread(file = “data/big_file.csv”, nrows = 100)
R
# Reading rows 10001 to 10101
> fread(file = “data/big_file.csv”, skip = 10000,
Rnrows = 100)
Using factors instead of character can save quite a lot of memory space too:
> library(data.table)
R> fread(file = "data/communities_raw.csv",
RstringsAsFactors = FALSE) |>
object.size() / 10^6
414.4 Mbytes
> fread(file = "data/communities_raw.csv",
RstringsAsFactors = TRUE) |>
object.size() / 10^6
271.1 Mbytes
The function is called fread
because it reads fast, using several cores if available, it’s very smart at guessing types and it shows a progress bar on large files.
Smaller than memory but dplyr is slow?
Maybe your dplyr
data wrangling step before getting analyses done is taking a few minutes or a few hours and you wouldn’t mind trying to speed things up without having to rewrite everything…
tidyverse developers too and they created dtplyr
to help everyone with that. Add library(dtplyr)
at the beginning of your script, lazy_dt(your_data)
and bam all your dplyr
verbs are going to be translated into data.table
calls in the background, you won’t have to change anything else in your script… data.table
may be faster thanks to two advantages: 1) fast distributed functions such as mean()
and many others and 2) the ability to make operation by reference
ie without your column having to be copied in a different place in the memory and a new spot being booked to write the result of your operation because it’s all done in the same place in memory.
Larger than memory using Arrow
Arrow
is a cross-language multi-platform suite of tools to work on in-memory and larger-than-memory files written in C++.
You can use it to access large files and make your usual data wrangling operations on it, even using dplyr
verbs.
First read your data with one of the arrow functions:
read_delim_arrow()
: read a delimited text fileread_csv_arrow()
: read a comma-separated values (CSV) fileread_tsv_arrow()
: read a tab-separated values (TSV) fileread_parquet()
: read a file in Parquet formatread_feather()
: read a file in Arrow/Feather format
Arrow can read the whole data OR read some informations about the data set but without loading the data in memory, only column names, types, sizes, things like that.
Now you want to make operations on these data and R couldn’t because they don’t fit in memory but arrow is going to read your operations, translate them and execute them all at once when needed:
library(dplyr)
<- dset %>%
dset group_by(subset) %>%
summarize(mean_x = mean(x), min_y = min(y)) %>%
filter(mean_x > 0) %>%
arrange(subset)
# No operations were executed yet
%>% collect() # operations are executed and results given dset
Only once dplyr::collect()
is called the operations are ran outside of R by arrow. Meaning that the workflow can be much longer and have (much) more intermediate steps but data are loaded in memory for R only when it needs them like for plotting or running a statistical analysis.
Spatial Data
Here we will be looking at sf
, stars
(for rasters) and dbplyr
(as in databases…) and it is a little more advanced and specialised so I won’t go into much details but a few things I liked: Cropping a spatial object even before loading it into R using the wkt_filter argument of sf::st_read()
library(sf)
<- "data/nc.gpkg"
file c(xmin = -82,ymin = 36, xmax = -80, ymax = 37) |>
st_bbox() |> st_as_sfc() |> st_as_text() -> bb
st_read(file, wkt_filter = bb) |> nrow()
17 # out of 100
Even easier if you can write SQL queries directly:
<- "select BIR74,SID74,geom from 'nc.gpkg' where BIR74 > 1500"
q read_sf(file, query = q) |> nrow()
61 # out of 100
Using stars
, you can read a raster file without loading it into memory, this is quite similar to arrow
in the previous section, and a 100+Mbytes file results in a 12Mbytes object in memory in R.
Other interesting tools
It seems that packages dedicated to running statistical models (lm, glm, etc) directly on data sets too big for memory were a thing a few years ago but I can’t find recent packages targeting this problem… biglm
hasn’t been updated since 2020.
Great resources
dtplyr
Arrow and dplyr
- https://arrow.apache.org/docs/r/articles/arrow.html#analyzing-arrow-data-with-dplyr
- https://arrow.apache.org/cookbook/r/index.html
- https://arrow-user2022.netlify.app/hello-arrow
- https://hbs-rcs.github.io/large_data_in_R/#solution-example # examples
- https://jthomasmock.github.io/arrow-dplyr/#/ # presentation
- https://posit-conf-2023.github.io/arrow/materials/5_arrow_single_file.html#/single-file-api # presentation
- https://www.r-bloggers.com/2021/09/understanding-the-parquet-file-format/ #parquet data format
Spatial data
Happy to talk about it, share experiences, help you implement something (don’t you also think dtplyr
sounds like a great and easy tool?!) and hear your comments!
Best wishes,
Alban