---
title: "Extracting Phenotype Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Extracting Phenotype Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = FALSE
)
```

## Overview

UKB phenotype data is stored in a proprietary `.dataset` format on the RAP and cannot be read directly. The `extract_*` functions provide R interfaces for discovering approved fields and extracting phenotype data via the DNAnexus `dx extract_dataset` and `table-exporter` tools.

Two workflows are available:

| Function | Mode | Scale | Output |
|---|---|---|---|
| `extract_batch()` | Async job | Large / production (typically 50+ fields) | job ID → CSV on RAP cloud |
| `extract_pheno()` | Synchronous | Small (quick checks) | data.table in memory |

**`extract_batch()` is the recommended approach** for any serious analysis. `extract_pheno()` is provided for quick interactive inspection inside the RAP environment only.

---

## Prerequisites

Ensure you are authenticated and have selected your project:

```{r auth}
library(ukbflow)

auth_login()
auth_select_project("project-XXXXXXXXXXXX")
```

---

## Step 1: Browse Available Fields

Before extracting, use `extract_ls()` to explore what fields are approved for your project:

```{r extract-ls}
# List all approved fields (cached after first call)
extract_ls()

# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p21022")

# Force refresh after switching projects or datasets
extract_ls(refresh = TRUE)
```

The result is a data.frame with two columns:

| Column | Example |
|---|---|
| `field_name` | `participant.p53_i0` |
| `title` | `Date of attending assessment centre \| Instance 0` |

> Fields reflect your project's approved data only — not all UKB fields are present.

---

## Step 2: Extract Data

### Recommended: `extract_batch()`

For large-scale or production extractions, submit an asynchronous table-exporter job on the RAP cloud:

```{r extract-batch}
# Submit extraction job
job_id <- extract_batch(c(31, 53, 21022, 22189))

# Custom output name
job_id <- extract_batch(
  field_id = c(31, 53, 21022, 22189),
  file     = "ukb_demographics"
)

# High priority (faster queue, higher cost)
job_id <- extract_batch(
  field_id = c(31, 53, 21022, 22189),
  priority = "high"
)
```

The job runs asynchronously on the RAP cloud. The output CSV is saved to your RAP project and can be monitored with the `job_` series:

```{r job-monitor}
job_status(job_id)        # check progress
job_path(job_id)          # get cloud file path once complete
job_result(job_id)        # read result as data.table (inside RAP only)
```

#### Instance type

`extract_batch()` automatically selects an appropriate instance based on the number of columns:

| Columns | Instance |
|---|---|
| ≤ 20 | `mem1_ssd1_v2_x4` |
| ≤ 100 | `mem1_ssd1_v2_x8` |
| ≤ 500 | `mem1_ssd1_v2_x16` |
| > 500 | `mem1_ssd1_v2_x36` |

You can override this with the `instance_type` argument if needed.

---

### Quick inspection: `extract_pheno()`

For small-scale interactive checks **inside the RAP RStudio environment**:

```{r extract-pheno}
df <- extract_pheno(c(31, 53, 21022))
```

> `extract_pheno()` is restricted to the RAP environment and returns data in memory only. For any analysis intended to be saved or reproduced, use `extract_batch()`.

Note: `extract_pheno()` returns **raw coded values** (e.g. `1`/`0` for Sex, numeric codes for diseases). Use the `decode_*` series to convert codes to human-readable labels.

---

## A Note on Column Names

Column naming differs between the two extraction methods:

**`extract_batch()`** — no prefix:

| Column | Meaning |
|---|---|
| `eid` | Participant ID |
| `p31` | Field 31 (Sex) |
| `p53_i0` | Field 53, Instance 0 |
| `p20002_i0_a0` | Field 20002, Instance 0, Array 0 |

**`extract_pheno()`** — `participant.` prefix:

| Column | Meaning |
|---|---|
| `participant.eid` | Participant ID |
| `participant.p31` | Field 31 (Sex) |
| `participant.p53_i0` | Field 53, Instance 0 |
| `participant.p20002_i0_a0` | Field 20002, Instance 0, Array 0 |

---

## Getting Help

- `?extract_ls`, `?extract_pheno`, `?extract_batch`
- `vignette("auth")` — authentication setup
- [GitHub Issues](https://github.com/evanbio/ukbflow/issues)
