TestGenerator helps you test pharmacoepidemiological study code against a small, explicit OMOP CDM test population. The typical workflow is:
This vignette uses the ICU sample population included with the package.
An Excel input file should contain one sheet per OMOP CDM table. For
example, the sheet names can include person,
observation_period, visit_occurrence,
condition_occurrence, drug_exposure, and
measurement.
library(TestGenerator)
file_path <- system.file(
"extdata",
"icu_sample_population.xlsx",
package = "TestGenerator"
)
output_path <- file.path(tempdir(), "testgenerator-example")
dir.create(output_path, showWarnings = FALSE, recursive = TRUE)
readPatients(
filePath = file_path,
testName = "icu_sample",
outputPath = output_path,
cdmVersion = "5.4"
)This writes icu_sample.json to output_path.
Keeping these JSON files in tests/testthat/testCases makes
them easy to reuse from package tests. When
outputPath = NULL, TestGenerator writes to that default
test case folder.
Use patientsCDM() to create a CDM reference containing
the small patient population and a complete vocabulary. By default, the
CDM is created in DuckDB.
cdm <- patientsCDM(
pathJson = output_path,
testName = "icu_sample",
cdmVersion = "5.4"
)
cdm[["person"]]If pathJson = NULL, TestGenerator looks for JSON files
in tests/testthat/testCases.
Once the test CDM is available, run the same study code you use on a
real CDM. The package includes example cohort definitions under
inst/extdata/test_cohorts.
library(CDMConnector)
library(dplyr)
library(testthat)
test_cohorts <- system.file(
"extdata",
"test_cohorts",
package = "TestGenerator"
)
cohort_set <- readCohortSet(test_cohorts)
cdm <- generateCohortSet(
cdm = cdm,
cohortSet = cohort_set,
name = "test_cohorts"
)
cohort_attrition <- attrition(cdm[["test_cohorts"]])
excluded_records <- cohort_attrition |>
pull(excluded_records) |>
sum()
expect_equal(excluded_records, 0)In a package test, place this code in
tests/testthat/test-*.R and assert the specific counts,
dates, durations, or intersections that your study should produce for
the micro population.
graphCohort() can help inspect whether cohort
intersections and timing look as expected for a single subject.
diazepam <- cdm[["test_cohorts"]] |>
filter(cohort_definition_id == 1) |>
collect()
hospitalisation <- cdm[["test_cohorts"]] |>
filter(cohort_definition_id == 2) |>
collect()
icu_visit <- cdm[["test_cohorts"]] |>
filter(cohort_definition_id == 3) |>
collect()
graphCohort(
subject_id = 4,
cohorts = list(
diazepam = diazepam,
hospitalisation = hospitalisation,
icu_visit = icu_visit
)
)If you want to design a new test population from scratch, create an Excel workbook with the required CDM table columns.
generateTestTables(
tableNames = c(
"person",
"observation_period",
"visit_occurrence",
"condition_occurrence",
"drug_exposure",
"measurement"
),
cdmVersion = "5.4",
outputFolder = output_path,
filename = "my_test_population"
)Fill in the workbook rows for the small set of patients needed by
your test, then pass the completed workbook to
readPatients().
For CSV inputs, place one file per CDM table in a folder. File names
should match the table names, for example person.csv and
observation_period.csv.
csv_path <- system.file(
"extdata",
"mimic_sample",
package = "TestGenerator"
)
readPatients.csv(
filePath = csv_path,
testName = "mimic_sample",
outputPath = output_path,
cdmVersion = "5.4"
)For source datasets with very large integer identifiers, set
reduceLargeIds = TRUE.
DuckDB is the default and is usually enough for unit tests. When you
need to test SQL translation on another backend,
patientsCDM() can also create a test CDM in Spark, SQL
Server, or PostgreSQL.
cdm <- patientsCDM(
pathJson = output_path,
testName = "icu_sample",
cdmVersion = "5.4",
dbms = "postgresql"
)
# Drop the remote test schema and disconnect when finished.
cleanupTestCdm(cdm)Remote database connections require the relevant environment
variables to be configured before calling
patientsCDM().
| Backend | Required environment variables |
|---|---|
| Spark | DATABRICKS_HOST, DATABRICKS_TOKEN,
DATABRICKS_HTTPPATH |
| SQL Server | DARWIN_SQLSERVER_SERVER,
DARWIN_SQLSERVER_DBNAME,
DARWIN_SQLSERVER_PORT, DARWIN_SQLSERVER_USER,
DARWIN_SQLSERVER_PASSWORD |
| PostgreSQL | DARWIN_POSTGRESQL_SERVER,
DARWIN_POSTGRESQL_DBNAME,
DARWIN_POSTGRESQL_PORT,
DARWIN_POSTGRESQL_USER,
DARWIN_POSTGRESQL_PASSWORD |
Spark also reads DATABRICKS_USER and
DATABRICKS_WORKSPACE when they are set. If they are not
set, TestGenerator uses token as the Databricks user and
hive_metastore as the workspace/catalog. SQL Server reads
SQL_SERVER_DRIVER when it is set; otherwise it uses
ODBC Driver 18 for SQL Server.