Tech news 9: Testing data

experience
R
package
technews
Published

March 15, 2024

Why testing data?

Looking inside the data you receive and produce is an absolute necessity but if you could have a second pair of eyes able to scan millions of rows in fractions of seconds and as often as needed, why not? Plus buildings your tests, ie data checks needs you to think forward to define your expectations for data and this helps controlling the workflow and ensuring data quality at each step.

How?

In software development, unit tests check that a function always behaves as expected. Unit tests can be extended to check data too, ie:

  • Are column names equal to “year” and “site” AND are all “year” values positive integers > 1998 AND is the “site” column of type factor?
  • Are the X trait values for species 1 in the range 10:20 AND in the range 12:30 for species 2?

By scripting as many tests as needed, we effortlessly make sure that entry data quality is optimal and that it stays correct along data processing. It is more reproducible and time saving.

Assertr for inline/in-workflow testing

An example of input data quality control from the assertr documentation:

Let’s say, for example, that the R’s built-in car dataset, mtcars, was not built-in but rather procured from an external source that was known for making errors in data entry or coding. Pretend we wanted to find the average miles per gallon for each number of engine cylinders. We might want to first, confirm - that it has the columns “mpg”, “vs”, and “am” - that the dataset contains more than 10 observations - that the column for ‘miles per gallon’ (mpg) is a positive number - that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean, and - that the “am” and “vs” columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only - each row contains at most 2 NAs - each row is unique jointly between the “mpg”, “am”, and “wt” columns - each row’s mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection) This could be written (in order) using assertr like this:

library(dplyr)
library(assertr)

mtcars %>%
  verify(has_all_names("mpg", "vs", "am", "wt")) %>%
  verify(nrow(.) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
  assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
  insist_rows(maha_dist, within_n_mads(10), everything()) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))

If any of these assertions were violated, an error would have been raised and the pipeline would have been terminated before calculation happened.

In this workflow, assertr was used for inline testing, it is immediately apparent, transparent and clear. But maybe you would rather have a suite of tests in a separate script that you would source() or that you would call with the dedicated and enriched with clear and helpful error messages testthat::test_file().

Test suites called by testthat

testthat was developed for software unit testing and it is the reference in R package building but extensions make it a highly efficient data testing tool. The usual testthat test suite is structured around expectations:

  • the rpois function is expected to error if argument x has NA
  • the rpois function is expected to return a vector of positive integers without NAs
library(testthat)
test_that(“rpois behaves as expected”, {
   expect_error(rpois(n = NA, 10))

   expect_false(anyNA(rpois(n = 5,10)))
   expect_vector(rpois(n = 5,10), size = 5)
   expect_gte(rpois(n = 5, 10), 0))
   expect_type(rpois(n = 5, 10), “integer”)
})

Now, if we could have more data oriented tests, easily applied to data frames, it would feel more natural than multiplying testthat expectations. For example, the last four testthat expectations could be rewritten with only one checkmate expectation:

checkmate::expect_integer(rpois(n = 5, 10), lower = 0, len = 5, any_missing = FALSE)

Another useful checkmate function to check column names: Expecting all column names in any order

expect_names(
  permutation.of = c(“region”,”site”,”plot”,”year”,”month”),
  what = "colnames"
)

Or expecting all names AND in order

expect_names(
  identical.to = c(“region”,”site”,”plot”,”year”,”month”),
  what = "colnames"
)

Or allowing only a subset of a list

expect_names(
  subset.of = c(“region”,”site”,”plot”,”year”,”month”),
  what = "colnames"
)

Checkmate was developed with focus on helpful error messages and efficiency:

Virtually every standard type of user error when passing arguments into function can be caught with a simple, readable line which produces an informative error message in case. A substantial part of the package was written in C to minimize any worries about execution time overhead. Furthermore, the package provides over 30 expectations to extend the popular testthat package for unit tests.

Now, let’s use testdat, another extension to testthat, to build a testing suite for data. First, the testdat authors give an example workflow in which the user creates data objects and visually and interactively checks them:

library(dplyr)

x <- tribble(
  ~id, ~pcode, ~state, ~nsw_only,
  1,   2000,   "NSW",  1,
  2,   3123,   "VIC",  NA,
  3,   2123,   "NSW",  3,
  4,   12345,  "VIC",  3
)

# check id is unique
x %>% filter(duplicated(id))

# check values
x %>% filter(!pcode %in% 2000:3999)
x %>% count(state)
x %>% count(nsw_only)

# check base for nsw_only variable
x %>% filter(state != "NSW") %>% count(nsw_only)

x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
                                     pcode %in% 3000:3999 ~ 2))

x %>% count(market)

And then using testdat:

library(testdat)
library(dplyr)

x <- tribble(
  ~id, ~pcode, ~state, ~nsw_only,
  1,   2000,   "NSW",  1,
  2,   3123,   "VIC",  NA,
  3,   2123,   "NSW",  3,
  4,   12345,  "VIC",  3
)

with_testdata(x, {
  test_that("id is unique", {
    expect_unique(id)
  })
  
  test_that("variable values are correct", {
    expect_values(pcode, 2000:2999, 3000:3999)
    expect_values(state, c("NSW", "VIC"))
    expect_values(nsw_only, 1:3) # by default expect_values allows NAs
  })
  
  test_that("filters applied correctly", {
    expect_base(nsw_only, state == "NSW")
  })
})

Where attention is needed only if a test fails… these tests can be stored in a separate script and called with the data they are applied to. Finally, in heavy data workflows where input data, received from data providers or automatic loggers, follows a limited number of formats, full data check suites can be automatically ran on reception with rich and parametrisable reporting and this is what pointblank was developed for.

Checking the data your data providers send/your automatic inputs

You can also use pointblank in your script as shown here:

library(pointblank)
dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE,
    actions = warn_on_fail()
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b),
    actions = warn_on_fail()
  ) %>%
  col_is_numeric(
    c(a, b),
    actions = warn_on_fail()
  )

But a better use for it is creating so-called agents that can automatically and reproducibly test data and provide rich visual reports just like this:

agent <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_agent(
    label = "A very *simple* example.",
    actions = al
  ) %>%
  col_vals_between(
    vars(a), 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    vars(c), 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(vars(a, b)) %>%
  interrogate()
print(agent)

And the pointblank package currently supports PostgreSQL. MySQL, MariaDB, Microsoft SQL Server, Google BigQuery, DuckDB, SQLite, and Spark DataFrames!

Resources

https://ropensci.r-universe.dev/assertr#

https://testthat.r-lib.org/

https://socialresearchcentre.r-universe.dev/testdat#

https://socialresearchcentre.github.io/testdat/articles/testdat.html

https://rstudio.r-universe.dev/pointblank#