---
title: "DSIR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{DSIR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>", 
  eval     = TRUE
)
```

```{r setup}
library(DSIR)
library(dplyr)
library(ggplot2)
```

<img src="../man/figures/logo.jpg" align="right" height="120" alt="DSIR logo" />

DSIR is a small R package for global health data work. It consists of 
WHO Member State metadata, lightweight clients for the GHO and UN SDG 
APIs, and reusable WHO-style `ggplot2` and `flextable` themes. DSIR 
is designed for health professionals, WHO staff, and global health 
researchers — the kind of users who do the same routine tasks every day.

This vignette walks through the typical workflow: looking up countries, 
fetching data from GHO and SDG, cleaning the raw response, and 
producing publication-style charts and tables.

## WHO Member State metadata

The `who_countries` tibble lists all 194 WHO Member States with their 
ISO3, ISO2, UN M49 codes, official names, short names, and WHO region. 
For Western Pacific countries, an extra column `is_pic` identifies the 
14 Pacific Island Countries.

```{r}
who_countries
```

For convenience, DSIR offers pre-defined vectors of ISO3 codes for 
each WHO region.

```{r}
wpro_cty
length(wpro_cty)   # 28 Member States in WPR (since May 2025)
```

The `is_pic` flag is useful because Pacific Island Countries are often 
analysed as a group, given their distinct demographic and geographic 
profiles.

```{r}
who_countries |>
  filter(is_pic) |>
  select(iso3, name_short)
```

When you have a vector of ISO3 codes and need to know which WHO 
region each belongs to, `iso3_to_region()` provides the lookup. 
It is vectorised and returns `NA` for codes that do not match a 
WHO Member State.

```{r}
iso3_to_region(c("PHL", "FRA", "ZAF", "USA", "XYZ"))
# "WPR" "EUR" "AFR" "AMR" NA
```

This is convenient when joining external datasets (which often arrive
keyed only by ISO3) to the WHO regional structure.

The companion helper `iso3_to_m49()` converts ISO3 codes to UN M49
numeric codes — useful because the WHO GHO API is keyed by ISO3
(`"PHL"`) while the UN SDG API is keyed by M49 (`"608"`). The M49
values are returned as three-character zero-padded strings, exactly
as stored in `who_countries$m49_code`.

```{r}
iso3_to_m49(c("PHL", "FRA", "JPN"))
# "608" "250" "392"

# Case-insensitive; non-Member areas return NA
iso3_to_m49(c("phl", "PRI"))
# "608" NA
```

In practice you can usually skip the explicit conversion: `sdg_data()`
and `sdg_coverage()` accept ISO3 codes for their `area` argument and
do the lookup internally (see the SDG section below).

## Checking availability before fetching

GHO has thousands of indicators, but any single indicator may not cover
the countries or years you need. Before issuing a full download with
`gho_data()`, three lightweight helpers let you ask the server what is
available without transferring any observations.

`gho_has_data()` is a quick yes / no for a given indicator and filter —
useful when screening a list of candidate indicators.

```{r}
# Does WHO have life-expectancy data for France?
gho_has_data("WHOSIS_000001", area = "FRA")
# TRUE

# Bulk-screen several indicators at once
inds <- c("WHOSIS_000001", "NCDMORT3070", "MDG_0000000026")
vapply(inds, gho_has_data, logical(1), area = "PHL")
```

It returns `TRUE`, `FALSE`, or `NA` (for request failures, including a
non-existent indicator code — GHO returns HTTP 404 in that case).

`gho_count()` returns the number of rows the same filter would produce,
which is useful for sizing a download.

```{r}
gho_count("WHOSIS_000001", area = wpro_cty)
```

`gho_coverage()` summarises year coverage and observation counts per
country. The payload is small because only `SpatialDim` and `TimeDim`
are requested from the server.

```{r}
gho_coverage("WHOSIS_000001", area = c("FRA", "DEU", "JPN"))
#>   location year_min year_max n_obs
#> 1 DEU          2000     2021    66
#> 2 FRA          2000     2021    66
#> 3 JPN          2000     2021    66
```

## Fetching indicator data from GHO

To fetch indicators from GHO, the typical workflow is three steps:
search for the indicator code, fetch the data, then clean the response.
The `area` argument accepts a long ISO3 vector, so a whole region can
be pulled in one call.

### Step 1: Search for an indicator

```{r}
gho_indicators("UHC") |> head()
```

Pick an `IndicatorCode` from the result — this is the value you pass 
to `gho_data()` in the next step.

### Step 2: Fetch the data

```{r}
uhc <- gho_data(
  indicator    = "UHC_INDEX_REPORTED",
  spatial_type = "country",
  area         = wpro_cty,
  year_from    = 2015
)

uhc |> glimpse()
```

Note that `area` accepts long ISO3 vectors — here we fetch all 28 WPR 
countries in one call.

### Step 3: Clean the raw response

`gho_clean()` produces the **unified DSIR cleaned-indicator schema** —
the same 15-column shape as `sdg_clean()`. Columns include `source`
(`"gho"`), `id`, `indicator`, `location`, `iso3`, `location_name`
(empty for GHO), `year`, `value`, `value_num`, `low`, `high`, `series`
(empty for GHO), and the three optional GHO dimensions
`dim1`–`dim3`. Columns absent from the raw response are filled with
typed `NA`.

```{r}
uhc_clean <- gho_clean(uhc)
uhc_clean
```

## Aggregating indicators with geomean()

Some health indicators are constructed as the geometric mean of 
component values rather than the arithmetic mean. The UHC Service 
Coverage Index, for example, aggregates 14 tracer indicators using 
nested geometric means. DSIR provides `geomean()` for this:

```{r}
# Unweighted geometric mean
geomean(c(0.6, 0.8, 0.95))
#> 0.7720589

# With optional weights — useful when tracers have different 
# methodological importance
geomean(c(0.6, 0.8, 0.95), w = c(2, 1, 1))
```

`geomean()` handles missing values, zeros, and negative values 
sensibly — see `?geomean` for details. It is a small helper, but 
it removes a common source of bugs when re-implementing index 
calculations from indicator components.

## Plotting with theme_dsi() and theme_dsi_facet()

DSIR provides two paired `ggplot2` themes tuned for WHO-style charts — 
clean panels, modest grids, and a consistent accent colour. Use them as 
drop-in replacements for `theme_minimal()` and `theme_bw()` respectively 
whenever a chart is heading into a WHO deliverable.

The rule of thumb is simple: **single-panel plots use `theme_dsi()`, 
faceted plots use `theme_dsi_facet()`**. The two share typography, title 
treatment, and legend handling, but differ in how they frame the data — 
the facet variant adds panel borders, light strip backgrounds, and 
breathing room between panels, all of which would look heavy on a 
single-panel chart.

### Single panel: `theme_dsi()`

`theme_dsi()` keeps the chart chrome minimal — a half-frame axis, light grid lines, and the WHO-blue accent on the axis line. By default the grid runs in both directions; pass `grid = "y"` for the minimalist horizontal-only look.

```{r, fig.width = 7, fig.height = 4}
uhc_clean |>
  filter(iso3 %in% c("AUS", "CHN", "PHL", "FJI")) |>
  left_join(who_countries, by = "iso3") |>
  ggplot(aes(x = year, y = value_num, group = iso3, color = name_short)) +
  geom_line(linewidth = .8) +
  geom_point(size = 1.8) +
  theme_dsi() +
  labs(
    title    = "UHC Service Coverage Index, selected WPR Member States",
    subtitle = "2015 onwards",
    x = NULL, y = "SCI", color = NULL
  )
```

For bar charts, pair `theme_dsi()` with `scale_y_dsi_col()` (or `scale_x_dsi_col()` when `value` is mapped to `x`) — these are thin wrappers around `scale_*_continuous()` that remove the lower axis expansion, so columns sit flush with the axis instead of floating above it.

```{r, fig.width = 7, fig.height = 4}
uhc_clean |>
  filter(year == max(year)) |>
  left_join(who_countries, by = "iso3") |>
  arrange(desc(value_num)) |>
  head(10) |>
  ggplot(aes(reorder(name_short, value_num), value_num)) +
  geom_col(fill = "#0093D5") +
  coord_flip() +
  scale_y_dsi_col() +
  theme_dsi(grid = "x") +
  labs(
    title    = "UHC Service Coverage Index, top 10 WPR Member States",
    subtitle = "Latest available year",
    x = NULL, y = "SCI"
  )
```

### Faceted: `theme_dsi_facet()`

When the same chart is split across many small panels, the half-frame 
look becomes visually noisy — the accent-blue axis line repeats across 
every facet. `theme_dsi_facet()` switches to a full panel border, adds a 
light grey strip background to clearly mark each facet's label, and 
introduces panel spacing so adjacent panels don't run together.

```{r, fig.width = 8, fig.height = 5}
uhc_clean |>
  left_join(who_countries, by = "iso3") |>
  filter(is_pic) |>
  ggplot(aes(x = year, y = value_num)) +
  geom_line(color = "#0093D5", linewidth = 0.8) +
  geom_point(color = "#0093D5", size = 1.5) +
  facet_wrap(~ name_short, ncol = 4) +
  theme_dsi_facet() +
  labs(
    title    = "UHC Service Coverage Index, Pacific Island Countries",
    subtitle = "Each panel shows one country's trajectory",
    x = NULL, y = "SCI"
  )
```

The `strip_fill` argument lets you change the strip background colour 
for emphasis — for example, a light-blue tone derived from the WHO 
accent for a deliverable where the strips themselves carry meaning:

```{r, fig.width = 8, fig.height = 5}
uhc_clean |>
  left_join(who_countries, by = "iso3") |>
  filter(is_pic) |>
  ggplot(aes(x = year, y = value_num)) +
  geom_line(color = "#0093D5", linewidth = 0.8) +
  facet_wrap(~ name_short, ncol = 4) +
  theme_dsi_facet(strip_fill = "#E5F4FB") +
  labs(title = "UHC SCI, PIC — with custom strip colour",
       x = NULL, y = "SCI")
```

## Tables with dsi_flextable_defaults()

`dsi_flextable_defaults()` sets WHO-style defaults for `flextable` 
globally — booktabs theme, bold headers, modest padding. Call it once 
near the top of your report and every subsequent `flextable()` picks 
up the formatting.

```{r}
library(flextable)
dsi_flextable_defaults(font_family = "Geogria")

uhc_clean |>
  filter(year == max(year)) |>
  left_join(who_countries, by = "iso3") |>
  select(name_short, value_num) |>
  arrange(desc(value_num)) |>
  flextable() |>
  set_table_properties("autofit", width = .6) %>%
  set_caption("UHC SCI in WPR, latest year")
```

## Working with SDG indicators

`sdg_data()` and `sdg_clean()` follow the same fetch-then-tidy
pattern as their GHO counterparts. The main differences are that
indicator codes use the dotted SDG format (e.g. `"3.4.1"`) and that
`value`, `low`, and `high` are kept as character — the SDG API
returns non-numeric entries (`"<0.1"`, aggregate notes) for some
rows, so coerce with `as.numeric()` only when you are ready to drop
them.

`sdg_indicators()` accepts an optional `search` argument with the
same behaviour as `gho_indicators()` — multiple keywords are AND-ed
together and matched case-insensitively against the indicator
description. The filter runs client-side because the UN SDG
indicator list is short (~250 rows) and the endpoint is not OData.

```{r}
# All indicators that mention both mortality and cancer
sdg_indicators("mortality cancer")

# Same as above, but with explicit terms (allows whitespace inside a term)
sdg_indicators(c("maternal", "mortality"))
```

The `area` argument of `sdg_data()` and `sdg_coverage()` accepts
either ISO3 codes (converted internally via `iso3_to_m49()`) or UN
M49 numeric codes — so DSIR's regional vectors (`wpro_cty`,
`afro_cty`, etc.) work directly, the same way they do with the GHO
client. Do not mix the two formats in a single call.

```{r}
# ISO3 — regional vector passed straight through
sdg <- sdg_data(
  indicator = "3.4.1",
  area      = wpro_cty
)
sdg |> glimpse()

# M49 also works (e.g. when copy-pasting codes from sdg_areas())
sdg_data("3.4.1", area = c("608", "250"))
```

```{r}
sdg_clean(sdg)
```

`sdg_clean()` produces the same 15-column schema as `gho_clean()`,
so the two outputs can be combined directly with `bind_indicators()`.
SDG rows populate the `series` column (and the `iso3` column via
[`m49_to_iso3()`] for Member States), while leaving the GHO-only
`dim1`–`dim3` columns as `NA`.

### Combining GHO and SDG with bind_indicators()

When an analysis pulls indicators from both sources, `bind_indicators()`
stacks any number of cleaned tibbles into one. The `source` column
(`"gho"` / `"sdg"`) lets you filter or facet by origin without
remembering which frame came from which API.

```{r}
# Two indicators on the same topic from different APIs:
#   GHO NCDMORT3070 (probability of premature NCD mortality)
#   SDG 3.4.1       (mortality rate from NCDs)
gho_ncd <- gho_data("NCDMORT3070", area = wpro_cty) |> gho_clean()
sdg_ncd <- sdg_data("3.4.1",        area = wpro_cty) |> sdg_clean()
bind_indicators(gho_ncd, sdg_ncd) |> glimpse()
```

### Exploring series with sdg_coverage()

A single SDG indicator often contains several **series** — for
example different vaccines, sex strata, or causes of death — each
with its own country and year coverage. Indicator `"3.b.1"`
(vaccine coverage) is a clear case: it is published as four
separate series (DTP3, MCV2, PCV3, HPV), and the year coverage of
the newer vaccines is much shorter than that of DTP3.

`sdg_coverage()` summarises the year range and observation count
per `(location, series)`, so you can inspect what series exist and
how each is covered before deciding which one to analyse.

```{r}
sdg_coverage("3.b.1", area = c("156", "608"))
#>   location series      year_min year_max n_obs
#> 1 156      SH_ACS_DTP3     2000     2023    24
#> 2 156      SH_ACS_HPV      2018     2023     6
#> 3 156      SH_ACS_MCV2     2000     2023    24
#> 4 156      SH_ACS_PCV3     2017     2023     7
#> 5 608      SH_ACS_DTP3     2000     2023    24
#> 6 608      SH_ACS_HPV      2017     2023     7
#> 7 608      SH_ACS_MCV2     2000     2023    24
#> 8 608      SH_ACS_PCV3     2014     2023    10
```

Note that DSIR intentionally does *not* provide SDG analogues of
`gho_has_data()` and `gho_count()`. SDG data is generally complete
enough that those screening helpers add little value — the more
useful pre-analysis question for SDG is "which series are
available?", which is what `sdg_coverage()` answers.

## Where to next

- Source code lives at <https://github.com/shanlong-who/DSIR>.
- Bug reports, feature requests, and pull requests are all welcome — 
  please file them on the GitHub issue tracker.