---
title: "End-to-End Pipeline: From API to Multi-Sport Analysis"
author: "vald.extractor"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{End-to-End Pipeline: From API to Multi-Sport Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Introduction

The `vald.extractor` package provides a robust, production-ready pipeline for extracting, cleaning, and analyzing VALD ForceDecks data across multiple sports. This vignette demonstrates the complete workflow from API authentication to publication-ready visualizations.

### Key Problems Solved

1. **Stability**: Chunked batch processing prevents timeout errors when working with large datasets (1000+ tests)
2. **Data Cleaning**: Automated sports taxonomy mapping standardizes inconsistent team/group names
3. **Reproducibility**: All processing steps are documented and version-controlled
4. **Generic Analysis**: Test-type suffix removal enables writing analysis code once that works for all test types

## Step 1: Authentication and Data Extraction

First, set your VALD API credentials and extract test/trial data:

```{r extract-data}
library(vald.extractor)

# Set credentials
valdr::set_credentials(
  client_id     = "your_client_id",
  client_secret = "your_client_secret",
  tenant_id     = "your_tenant_id",
  region        = "aue"
)

# Fetch data from 2020 onwards in chunks of 100 tests
vald_data <- fetch_vald_batch(
  start_date = "2020-01-01T00:00:00Z",
  chunk_size = 100,
  verbose = TRUE
)

# Extract components
tests_df <- vald_data$tests
trials_df <- vald_data$trials

cat("Extracted", nrow(tests_df), "tests and", nrow(trials_df), "trials\n")
```

**Why chunking matters**: Without chunking, large organizations with 5000+ tests will experience API timeout errors. The chunked approach processes 100 tests at a time, with fault-tolerant error handling that logs issues without halting the entire extraction.

## Step 2: Fetch and Standardize Metadata

Retrieve athlete profiles and group memberships via OAuth2:

```{r fetch-metadata}
# Fetch raw metadata
metadata <- fetch_vald_metadata(
  client_id     = "your_client_id",
  client_secret = "your_client_secret",
  tenant_id     = "your_tenant_id",
  region        = "aue"
)

# Standardize: unnest group memberships and create unified athlete records
athlete_metadata <- standardize_vald_metadata(
  profiles = metadata$profiles,
  groups   = metadata$groups
)

head(athlete_metadata)
```

### Understanding the Metadata Structure

The VALD API stores group memberships as a nested array (`groupIds`). The `standardize_vald_metadata()` function:

1. Unnests the array so each athlete-group pair gets its own row
2. Joins with group names from the Groups API
3. Collapses back to one row per athlete with all group names concatenated

**Result**: A clean metadata table where `all_group_names` contains "Football, U18, Elite" for an athlete in multiple groups.

## Step 3: Apply Sports Taxonomy

Map inconsistent team names to standardized sports categories:

```{r classify-sports}
athlete_metadata <- classify_sports(
  data = athlete_metadata,
  group_col = "all_group_names",
  output_col = "sports_clean"
)

# Inspect the mapping
table(athlete_metadata$sports_clean)
```

**The Value Add**: This regex-based classification is the core innovation. Organizations often have:

- "Football", "Soccer", "FSI" → all mapped to "Football"
- "Track", "Field", "T&F" → all mapped to "Track & Field"

Without this automation, analysts spend hours manually categorizing athletes. The package includes patterns for 15+ sports and can be easily extended.

## Step 4: Transform to Wide Format and Join

Combine trials into tests, pivot to wide format, and merge with metadata:

```{r transform-wide}
library(dplyr)

# Join trials and tests
all_data <- left_join(trials_df, tests_df, by = c("testId", "athleteId"))

# Aggregate trials and pivot to wide format
structured_test_data <- all_data %>%
  group_by(athleteId, testId, testType, recordedUTC,
           recordedDateOffset, trialLimb, definition_name) %>%
  summarise(
    mean_result = mean(as.numeric(value), na.rm = TRUE),
    mean_weight = mean(as.numeric(weight), na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    TestTimestampUTC = lubridate::ymd_hms(recordedUTC),
    TestTimestampLocal = TestTimestampUTC + lubridate::minutes(recordedDateOffset),
    Testdate = as.Date(TestTimestampLocal)
  ) %>%
  select(athleteId, Testdate, testId, testType, trialLimb,
         definition_name, mean_result, mean_weight) %>%
  tidyr::pivot_wider(
    id_cols = c(athleteId, Testdate, testId, mean_weight),
    names_from = c(definition_name, trialLimb, testType),
    values_from = mean_result,
    names_glue = "{definition_name}_{trialLimb}_{testType}"
  ) %>%
  rename(Weight_on_Test_Day = mean_weight)

# Join with metadata
final_analysis_data <- structured_test_data %>%
  mutate(profileId = as.character(athleteId)) %>%
  left_join(
    athlete_metadata %>% mutate(profileId = as.character(profileId)),
    by = "profileId"
  ) %>%
  mutate(
    Testdate = as.Date(Testdate),
    dateofbirth = as.Date(dateOfBirth),
    age = as.numeric((Testdate - dateofbirth) / 365.25),
    sports = sports_clean
  )

cat("Final dataset:", nrow(final_analysis_data), "rows with",
    ncol(final_analysis_data), "columns\n")
```

## Step 5: Split by Test Type

The "Don't Repeat Yourself" (DRY) principle in action:

```{r split-tests}
# Split into separate datasets per test type
test_datasets <- split_by_test(
  data = final_analysis_data,
  metadata_cols = c("profileId", "sex", "Testdate", "dateofbirth",
                    "age", "testId", "Weight_on_Test_Day", "sports")
)

# Access individual test types
cmj_data <- test_datasets$CMJ
dj_data <- test_datasets$DJ

# Crucially: column names are now generic
head(names(cmj_data))
# "profileId", "sex", "Testdate", "PEAK_FORCE_Both", "JUMP_HEIGHT_Both", ...
# Note: "_CMJ" suffix has been removed!
```

**Why this matters**: You can now write one analysis function that works for all test types:

```{r generic-analysis}
analyze_peak_force <- function(test_data) {
  summary(test_data$PEAK_FORCE_Both)  # Works for CMJ, DJ, ISO, etc.
}

# Apply to all test types
lapply(test_datasets, analyze_peak_force)
```

Without suffix removal, you'd need separate code for `PEAK_FORCE_Both_CMJ`, `PEAK_FORCE_Both_DJ`, etc.

## Step 6: Patch Missing Metadata (Optional)

Fix missing or incorrect demographic data:

```{r patch-metadata}
# Create an Excel file with: profileId, sex, dateOfBirth
# Example: corrections.xlsx with rows like:
#   profileId         sex       dateOfBirth
#   abc123           Male      1995-03-15
#   def456           Female    1998-07-22

cmj_data <- patch_metadata(
  data = cmj_data,
  patch_file = "corrections.xlsx",
  patch_sheet = 1,
  id_col = "profileId",
  fields_to_patch = c("sex", "dateOfBirth")
)

# Verify corrections
table(cmj_data$sex)  # "Unknown" values should now be fixed
```

## Step 7: Generate Summary Statistics

Create publication-ready summary tables:

```{r summary-stats}
cmj_summary <- summary_vald_metrics(
  data = cmj_data,
  group_vars = c("sex", "sports"),
  exclude_cols = c("profileId", "testId", "Testdate", "dateofbirth", "age")
)

# View summary
print(cmj_summary)

# Export to CSV
write.csv(cmj_summary, "cmj_summary_by_sport_sex.csv", row.names = FALSE)
```

**Output example**:
```
sex    sports      PEAK_FORCE_Both_Mean  PEAK_FORCE_Both_SD  PEAK_FORCE_Both_CV  PEAK_FORCE_Both_N
Male   Football    2450.32               245.67              10.02               45
Male   Basketball  2310.45               198.23              8.58                32
Female Football    1980.12               187.45              9.47                38
```

## Step 8: Visualize Trends

Track performance over time:

```{r plot-trends}
library(ggplot2)

# Plot CMJ peak force trends by athlete
plot_vald_trends(
  data = cmj_data,
  date_col = "Testdate",
  metric_col = "PEAK_FORCE_Both",
  group_col = "profileId",
  facet_col = "sex",
  title = "CMJ Peak Force Trends by Athlete",
  smooth = TRUE
)

# Plot sport-level averages over time
sport_trends <- cmj_data %>%
  group_by(Testdate, sports) %>%
  summarise(avg_force = mean(PEAK_FORCE_Both, na.rm = TRUE), .groups = "drop")

plot_vald_trends(
  data = sport_trends,
  date_col = "Testdate",
  metric_col = "avg_force",
  group_col = "sports",
  title = "Average CMJ Peak Force by Sport Over Time"
)
```

## Step 9: Compare Across Groups

Create boxplots for cross-sectional comparisons:

```{r plot-compare}
plot_vald_compare(
  data = cmj_data,
  metric_col = "PEAK_FORCE_Both",
  group_col = "sports",
  fill_col = "sex",
  title = "CMJ Peak Force Comparison by Sport and Sex"
)

# Compare jump height
plot_vald_compare(
  data = cmj_data,
  metric_col = "JUMP_HEIGHT_Both",
  group_col = "sports",
  fill_col = "sex",
  title = "CMJ Jump Height Comparison"
)
```

## Advanced: Multi-Test Analysis

Analyze multiple test types simultaneously:

```{r multi-test}
# Define a function to extract a common metric across test types
compare_metric_across_tests <- function(test_datasets, metric = "PEAK_FORCE_Both") {

  results <- lapply(names(test_datasets), function(test_name) {
    test_data <- test_datasets[[test_name]]

    if (metric %in% names(test_data)) {
      data.frame(
        testType = test_name,
        metric = metric,
        mean = mean(test_data[[metric]], na.rm = TRUE),
        sd = sd(test_data[[metric]], na.rm = TRUE),
        n = sum(!is.na(test_data[[metric]]))
      )
    }
  })

  do.call(rbind, results)
}

# Compare peak force across CMJ, DJ, and ISO
force_comparison <- compare_metric_across_tests(test_datasets, "PEAK_FORCE_Both")
print(force_comparison)
```

## Best Practices for Production Use

### 1. Schedule Regular Updates

```{r scheduled-updates, eval=FALSE}
# Weekly refresh script
library(vald.extractor)

# Fetch only new data since last update
last_update <- "2024-01-01T00:00:00Z"

new_data <- fetch_vald_batch(
  start_date = last_update,
  chunk_size = 100
)

# Append to existing database
load("vald_database.RData")
updated_tests <- rbind(existing_tests, new_data$tests)
updated_trials <- rbind(existing_trials, new_data$trials)

save(updated_tests, updated_trials, file = "vald_database.RData")
```

### 2. Error Logging

The chunked extraction automatically logs errors without halting:

```{r error-logging}
# Errors are printed to console with chunk information:
# "ERROR on chunk 23 (rows 2201-2300): API timeout"
# "Continuing to next chunk..."

# This ensures partial data extraction even if some chunks fail
```

### 3. Version Control Your Taxonomy

Store your sports classification rules in a separate config file:

```{r taxonomy-config}
# sports_taxonomy.R
sports_patterns <- list(
  Football = "Football|FSI|TCFC|MCFC|Soccer",
  Basketball = "Basketball|BBall",
  Cricket = "Cricket",
  # ... add your organization's patterns
)

# Then use in classify_sports()
```

## Conclusion

The `vald.extractor` package transforms raw VALD API data into analysis-ready datasets with:

- **Fault-tolerant extraction**: Chunked processing with error handling
- **Automated taxonomy**: Regex-based sports classification saves hours of manual work
- **Generic programming**: Suffix removal enables DRY analysis code
- **Publication-ready outputs**: Summary tables and professional visualizations

This workflow is production-tested with 10,000+ tests across 15+ sports and is designed for CRAN submission.

## Next Steps

- Customize sports taxonomy for your organization
- Integrate with existing reporting pipelines
- Schedule automated weekly updates
- Explore advanced visualizations with `ggplot2` extensions

For issues or feature requests, visit: [GitHub Issues](https://github.com/praveenmaths89/vald.extractor/issues)
