---
title: "Introduction to sparsecommunity"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to sparsecommunity}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse   = TRUE,
  comment    = "#>",
  fig.width  = 6,
  fig.height = 4,
  fig.align  = "center"
)
set.seed(42)
```

## Overview

**sparsecommunity** implements spectral clustering for community detection in
sparse networks. It provides:

- `community_detect()`: main estimation function, supporting both the stochastic
  block model (`"sbm"`) and the degree-corrected stochastic block model
  (`"dcsbm"`).
- `simulate_sbm()` / `simulate_dcsbm()`: simulation utilities for both models.
- `misclustering_rate()`: best-permutation evaluation metric.

The methods implement the regularized normalized Laplacian spectral clustering
of Lei and Rinaldo (2015), which is consistent for community recovery when the
maximum expected degree grows as slowly as $\log(n)$ — the sparse regime
where many standard spectral methods break down.

```{r load-package}
library(sparsecommunity)
```

---

## 1. Stochastic block model (SBM)

### 1.1 Simulating SBM data

A stochastic block model with $K$ communities is specified by a block
probability matrix $B \in [0,1]^{K \times K}$. Nodes in community $k$ and
community $\ell$ are connected independently with probability $B_{k\ell}$.

```{r sim-sbm}
# Two balanced communities, n = 300 nodes
# Within-community edge probability: 0.25; between: 0.04
B_sbm <- matrix(c(0.25, 0.04,
                   0.04, 0.25), nrow = 2)

sim_sbm <- simulate_sbm(n = 300, K = 2, B = B_sbm, seed = 1)
print(sim_sbm)
```

The returned object contains the sparse adjacency matrix `A` and the true
community labels:

```{r sbm-structure}
# Mean degree (note: sparse regime ~ log(n)/n * n = log(n))
mean(Matrix::rowSums(sim_sbm$A))
```

### 1.2 Fitting the SBM model

For the standard SBM, `community_detect()` embeds the network via the
regularized normalized Laplacian and applies $k$-means:

```{r fit-sbm}
fit_sbm <- community_detect(sim_sbm$A, K = 2, model = "sbm", seed = 1)
print(fit_sbm)
```

The fitted object contains the community labels, the spectral embedding, the
top eigenvalues, and the clustering objective:

```{r sbm-components}
# Top-K eigenvalues of the regularized Laplacian
fit_sbm$eigenvalues

# Community sizes
table(fit_sbm$labels)
```

### 1.3 Evaluating accuracy

`misclustering_rate()` computes the fraction of misclassified nodes under the
best alignment of estimated and true labels (Hungarian algorithm):

```{r sbm-accuracy}
misclustering_rate(sim_sbm$labels, fit_sbm$labels)
```

### 1.4 Examining the spectral embedding

The two-dimensional embedding reveals the cluster structure directly:

```{r sbm-embedding, fig.cap="Spectral embedding for SBM. Points are colored by true community."}
U <- fit_sbm$embedding
plot(U[, 1], U[, 2],
     col  = sim_sbm$labels + 1,
     pch  = 19, cex = 0.6,
     xlab = "Eigenvector 1",
     ylab = "Eigenvector 2",
     main = "SBM: spectral embedding")
legend("topright", legend = c("Community 1", "Community 2"),
       col = 2:3, pch = 19, bty = "n")
```

---

## 2. Degree-corrected stochastic block model (DCSBM)

In many real networks, nodes within the same community have very different
degrees. The DCSBM captures this by introducing per-node degree parameters
$\theta_i > 0$: the edge probability between nodes $i$ and $j$ is
$\theta_i \theta_j B_{z_i z_j}$.

Under degree heterogeneity, standard spectral methods fail because high-degree
nodes concentrate together in the embedding regardless of their community.
The spherical $k$-median algorithm (Lei and Rinaldo 2015, Theorem 4.2) corrects
for this by row-normalizing the embedding before clustering.

### 2.1 Simulating DCSBM data

```{r sim-dcsbm}
# Three communities with strong degree heterogeneity
B_dcsbm <- matrix(c(0.5, 0.04, 0.04,
                     0.04, 0.5, 0.04,
                     0.04, 0.04, 0.5), nrow = 3)

# Degree parameters: Uniform(0.3, 1.7), creating substantial heterogeneity
set.seed(2)
theta <- runif(400, min = 0.3, max = 1.7)

sim_dcsbm <- simulate_dcsbm(n = 400, K = 3, B = B_dcsbm,
                              theta = theta, seed = 2)
print(sim_dcsbm)
```

### 2.2 Why standard SBM clustering fails under degree heterogeneity

Applying the SBM method to DCSBM data shows the problem:

```{r dcsbm-sbm-fail}
fit_wrong <- community_detect(sim_dcsbm$A, K = 3, model = "sbm", seed = 2)
cat("Misclustering rate (SBM method on DCSBM data):",
    misclustering_rate(sim_dcsbm$labels, fit_wrong$labels), "\n")
```

### 2.3 Fitting the DCSBM model

The `"dcsbm"` model row-normalizes the embedding and applies spherical
$k$-median clustering:

```{r fit-dcsbm}
fit_dcsbm <- community_detect(sim_dcsbm$A, K = 3, model = "dcsbm", seed = 2)
print(fit_dcsbm)

cat("Misclustering rate (DCSBM method):",
    misclustering_rate(sim_dcsbm$labels, fit_dcsbm$labels), "\n")
```

### 2.4 Embedding comparison

The row-normalized embedding maps all nodes to the unit sphere. Degree
heterogeneity is absorbed by the normalization, and clusters separate by angle:

```{r dcsbm-embedding, fig.cap="Row-normalized spectral embedding for DCSBM. Colors indicate true communities."}
U_dc <- fit_dcsbm$embedding
plot(U_dc[, 1], U_dc[, 2],
     col  = sim_dcsbm$labels + 1,
     pch  = 19, cex = 0.5,
     xlab = "Eigenvector 1 (normalized)",
     ylab = "Eigenvector 2 (normalized)",
     main = "DCSBM: row-normalized spectral embedding")
legend("topright",
       legend = paste("Community", 1:3),
       col = 2:4, pch = 19, bty = "n")
```

---

## 3. Real-data example: Zachary's karate club

Zachary's (1977) karate club network is a benchmark for community detection.
The network has 34 members (nodes) and 78 friendships (edges). A conflict led
to a split into two factions, providing known ground-truth community membership.

```{r karate-data, message=FALSE}
if (!requireNamespace("igraphdata", quietly = TRUE)) {
  message("igraphdata not installed; skipping real-data example.")
  knitr::knit_exit()
}

library(igraph)
data("karate", package = "igraphdata")

# Extract adjacency matrix and true community labels
A_karate  <- igraph::as_adjacency_matrix(karate, sparse = TRUE)
true_comm <- igraph::V(karate)$Faction
cat("Nodes:", vcount(karate), "| Edges:", ecount(karate),
    "| Communities:", length(unique(true_comm)), "\n")
cat("Community sizes:", table(true_comm), "\n")
cat("Mean degree:", round(mean(degree(karate)), 2), "\n")
```

```{r karate-fit}
fit_karate <- community_detect(A_karate, K = 2, model = "sbm",
                                n_init = 30, seed = 42)
summary(fit_karate)

cat("Misclustering rate:", misclustering_rate(true_comm, fit_karate$labels), "\n")
```

```{r karate-plot, fig.cap="Karate club network. Node color = detected community; node shape = true faction."}
# Plot network colored by detected community
shape_map <- ifelse(true_comm == 1, "circle", "square")
igraph::plot.igraph(
  karate,
  vertex.color = fit_karate$labels + 1,
  vertex.shape = shape_map,
  vertex.size  = 8,
  vertex.label = NA,
  main         = "Karate club: detected vs. true communities"
)
legend("bottomleft",
       legend = c("Detected: 1", "Detected: 2"),
       fill   = 2:3, bty = "n", cex = 0.9)
legend("bottomright",
       legend = c("True: faction 1", "True: faction 2"),
       pch    = c(19, 15), bty = "n", cex = 0.9)
```

---

## 4. Real-data example: NCAA college football (Girvan & Newman 2002)

The NCAA football network is a standard benchmark for community detection with
$K > 2$ communities. The network has 115 Division I-A college football teams
(nodes) connected by regular-season games during Fall 2000 (613 edges). Teams
are divided into 12 athletic conferences, providing unambiguous ground-truth
community labels (Girvan & Newman 2002).

The network is bundled directly in **sparsecommunity**:

```{r football-data}
data("football")
cat("Nodes:", nrow(football_A), "| Edges:", sum(football_A) / 2, "\n")
cat("Mean degree:", round(mean(Matrix::rowSums(football_A)), 2),
    "  log(n):", round(log(nrow(football_A)), 2), "\n")
table(football_labels)   # 12 conferences
```

The mean degree (10.66) is well above $\log(115) = 4.74$, placing the network
in the sparse-but-identifiable regime. We first use `estimate_K()` to check
whether the number of conferences can be recovered from the data alone:

```{r football-estimate-K}
estimate_K(football_A, K_max = 15)   # true K = 12
```

The estimator returns 10, slightly underestimating due to two very small
conferences (sizes 5 and 7) that are hard to distinguish spectrally. We
proceed with the known $K = 12$:

```{r football-fit}
fit_football <- community_detect(football_A, K = 12, model = "sbm",
                                  n_init = 30, seed = 1)
misclustering_rate(football_labels, fit_football$labels)
```

**sparsecommunity** misclassifies 10 of 115 teams (8.7%). These are primarily
independent teams and teams from small conferences that schedule games across
conference lines. The spectral embedding confirms the 12-cluster structure:

```{r football-plot, fig.cap="Spectral embedding of the football network. Colors indicate detected community; the 12 athletic conferences are largely separated."}
plot(fit_football)
```

---

## 5. Summary of the `community_detect()` interface

| Argument   | Default   | Description |
|------------|-----------|-------------|
| `A`        | —         | Adjacency matrix (sparse or dense) |
| `K`        | —         | Number of communities |
| `model`    | `"dcsbm"` | `"sbm"` for k-means; `"dcsbm"` for spherical k-median |
| `n_init`   | `20`      | Random restarts for clustering |
| `max_iter` | `100`     | Max iterations per restart |
| `reg`      | `TRUE`    | Degree regularization in Laplacian embedding |
| `seed`     | `NULL`    | Random seed |

The returned `"sparsecommunity"` object contains:

| Component     | Description |
|---------------|-------------|
| `labels`      | Integer vector of community assignments (length $n$) |
| `embedding`   | $n \times K$ spectral embedding matrix |
| `eigenvalues` | Top-$K$ eigenvalues of the regularized Laplacian |
| `centers`     | $K \times K$ cluster centers in embedding space |
| `objective`   | Within-cluster sum of distances at convergence |

---

## References

Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in
stochastic block models. *Annals of Statistics*, **43**(1), 215–237.
https://doi.org/10.1214/14-AOS1274

Girvan, M. and Newman, M. E. J. (2002). Community structure in social and
biological networks. *Proceedings of the National Academy of Sciences*,
**99**(12), 7821–7826. https://doi.org/10.1073/pnas.122653799

Zachary, W. W. (1977). An information flow model for conflict and fission
in small groups. *Journal of Anthropological Research*, **33**(4), 452–473.
