| Title: | Inductive Node-Splitting Cross-Validation for Community Detection |
|---|---|
| Description: | Implements Inductive Node-Splitting Cross-Validation (INCV) for selecting the number of communities in stochastic block models. Provides f-fold and random-split node-level cross-validation, along with competing methods including CROISSANT, Edge Cross-Validation (ECV), and Node Cross-Validation (NCV). Supports both SBM and Degree-Corrected Block Models (DCBM), with multiple loss functions (L2, binomial deviance, AUC). Includes network simulation utilities for SBM, RDPG, and latent space models. |
| Authors: | Lin Zhang [aut, cre], Bokai Yang [aut] |
| Maintainer: | Lin Zhang <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-12 06:33:51 UTC |
| Source: | https://github.com/ivylinzhang97/incv-community-detection |
Negative AUC for matrix predictions
AUC(A, P)AUC(A, P)
A |
Observed binary adjacency (matrix or vector). |
P |
Predicted probability (matrix or vector). |
Negative AUC value (for minimisation-based model selection).
Matches community labels in lab to reference labels in fixed
via a greedy maximum-overlap permutation.
best.perm.label.match(lab, fixed, n = length(lab), K = max(lab, fixed))best.perm.label.match(lab, fixed, n = length(lab), K = max(lab, fixed))
lab |
Integer vector of labels to permute. |
fixed |
Integer vector of reference labels. |
n |
Number of nodes (default |
K |
Number of communities (default |
A K x K permutation matrix.
Binomial deviance loss
bin.dev(x, y)bin.dev(x, y)
x |
Observed binary matrix/vector. |
y |
Predicted probability matrix/vector. |
Total binomial deviance (non-finite terms set to 0).
Uses parallel edge sampling for speed; returns a sparse adjacency matrix.
blockmodel.gen.fast(n, K, B, psi = rep(1, n), PI = rep(1/K, K), ncore = 1)blockmodel.gen.fast(n, K, B, psi = rep(1, n), PI = rep(1/K, K), ncore = 1)
n |
Number of nodes. |
K |
Number of communities. |
B |
K x K block probability matrix. |
psi |
Degree heterogeneity vector (length |
PI |
Community prior probabilities (length |
ncore |
Number of cores for parallel computation. |
A list with:
A |
Sparse symmetric adjacency matrix. |
member |
Community membership vector. |
psi |
Scaled degree heterogeneity vector. |
Generates an adjacency matrix from a planted-partition SBM with k
communities. The first community has size n1; the remaining
communities share the leftover nodes roughly equally.
community.sim(k = 2, n = 1000, n1 = 100, p = 0.3, q = 0.1)community.sim(k = 2, n = 1000, n1 = 100, p = 0.3, q = 0.1)
k |
Number of communities (default 2). |
n |
Total number of nodes (default 1000). |
n1 |
Size of the smallest community (default 100). |
p |
Within-community connection probability (default 0.3). |
q |
Between-community connection probability (default 0.1). |
A list with:
membership |
Integer vector of community labels (length |
adjacency |
n x n binary symmetric adjacency matrix. |
net <- community.sim(k = 3, n = 120, n1 = 30, p = 0.5, q = 0.1) table(net$membership)net <- community.sim(k = 3, n = 120, n1 = 30, p = 0.5, q = 0.1) table(net$membership)
Generates an SBM where block probabilities decay exponentially with
the distance between community indices: B[k1,k2] = rho * eta^min(|k1-k2|,3).
community.sim.sbm(n, n1, eta = 0.3, rho = 0.1, K = 3)community.sim.sbm(n, n1, eta = 0.3, rho = 0.1, K = 3)
n |
Total number of nodes. |
n1 |
Size of the first (smallest) community. |
eta |
Decay parameter for inter-community probabilities (default 0.3). |
rho |
Scaling factor for the block probability matrix (default 0.1). |
K |
Number of communities (default 3). |
A list with:
adjacency |
n x n binary symmetric adjacency matrix. |
membership |
Integer vector of community labels. |
conn |
K x K block probability matrix. |
Cross-validated, Overlapping, In-Sample Selection of the Number of
communities and model Type. Jointly selects between SBM and DCBM and
the number of communities K using overlapping node subsamples.
croissant.blockmodel( A, K.CAND, s, o, R, tau = 0, laplace = FALSE, dc.est = 2, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )croissant.blockmodel( A, K.CAND, s, o, R, tau = 0, laplace = FALSE, dc.est = 2, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )
A |
Adjacency matrix (n x n, can be sparse). |
K.CAND |
Candidate community numbers (integer vector, or a single
integer interpreted as |
s |
Number of non-overlapping subsamples. |
o |
Size of the overlapping set. |
R |
Number of independent repetitions. |
tau |
Regularisation parameter for the adjacency (default 0). |
laplace |
Logical; use graph Laplacian normalisation (default FALSE). |
dc.est |
DCBM estimation method: 1 = eigenvector, 2 = fast (default 2). |
loss |
Character vector of loss functions to evaluate.
Supported: |
ncore |
Number of cores for parallel computation. |
A list containing:
loss |
data.table of loss values by candidate model and K. |
Candidate Models |
"SBM and DCBM". |
Candidate Values |
The candidate K vector. |
*.model |
Selected model string (e.g. "SBM-3") for each loss. |
Selects the latent dimension for a latent space network model using the CROISSANT framework with MLE fitting.
croissant.latent( A, d.cand, s, o, R, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )croissant.latent( A, d.cand, s, o, R, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )
A |
Adjacency matrix. |
d.cand |
Candidate embedding ranks. |
s |
Number of non-overlapping subsamples. |
o |
Overlap size. |
R |
Number of repetitions. |
loss |
Loss functions to evaluate. |
ncore |
Number of cores. |
A list with loss table and selected dimension per loss.
Selects the embedding rank for a Random Dot Product Graph model using the CROISSANT overlapping node-subsample framework.
croissant.rdpg( A, d.cand, s, o, R, laplace = FALSE, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )croissant.rdpg( A, d.cand, s, o, R, laplace = FALSE, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )
A |
Adjacency matrix. |
d.cand |
Candidate embedding ranks. |
s |
Number of non-overlapping subsamples. |
o |
Overlap size. |
R |
Number of repetitions. |
laplace |
Use Laplacian normalisation (default FALSE). |
loss |
Loss functions to evaluate. |
ncore |
Number of cores. |
A list with loss table and selected rank per loss.
Selects the regularisation parameter tau for spectral clustering using the CROISSANT framework.
croissant.tune.regsp( A, K, tau.cand, DCBM = FALSE, s, o, R, laplace = FALSE, dc.est = 2, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )croissant.tune.regsp( A, K, tau.cand, DCBM = FALSE, s, o, R, laplace = FALSE, dc.est = 2, loss = c("l2", "bin.dev", "AUC"), ncore = 1 )
A |
Adjacency matrix. |
K |
Fixed number of communities. |
tau.cand |
Candidate regularisation parameter values. |
DCBM |
Logical; if TRUE, use row-normalised eigenvectors (DCBM). |
s, o, R
|
CROISSANT design parameters. |
laplace |
Use Laplacian normalisation. |
dc.est |
DCBM estimation type (1 or 2). |
loss |
Loss functions to evaluate. |
ncore |
Number of cores. |
A list with loss table and selected tau per loss.
Selects both the model type (SBM vs DCBM) and number of communities by holding out random edges and evaluating predictive performance.
ECV.for.blockmodel( A, max.K, cv = NULL, B = 3, holdout.p = 0.1, tau = 0, dc.est = 2, kappa = NULL )ECV.for.blockmodel( A, max.K, cv = NULL, B = 3, holdout.p = 0.1, tau = 0, dc.est = 2, kappa = NULL )
A |
Adjacency matrix (n x n). |
max.K |
Maximum number of communities to consider. |
cv |
Number of cross-validation folds (NULL for holdout; default NULL). |
B |
Number of holdout repetitions (default 3). |
holdout.p |
Fraction of edges held out (default 0.1). |
tau |
Regularisation parameter (default 0). |
dc.est |
DCBM estimation type (1 or 2; default 2). |
kappa |
Truncation parameter (default NULL). |
The algorithm is based on foundation code from Lin Zhang's PhD dissertation in 2019.
A list with loss vectors and selected model strings:
l2 |
SBM L2 losses by K. |
dev |
SBM deviance losses by K. |
auc |
SBM AUC losses by K. |
dc.l2, dc.dev, dc.auc
|
DCBM losses. |
l2.model, dev.model, auc.model
|
Selected model strings. |
Selects the embedding rank for an RDPG by holding out random edges.
ECV.undirected.Rank( A, max.K, B = 3, holdout.p = 0.1, soft = FALSE, fast = FALSE )ECV.undirected.Rank( A, max.K, B = 3, holdout.p = 0.1, soft = FALSE, fast = FALSE )
A |
Adjacency matrix. |
max.K |
Maximum rank to consider. |
B |
Number of holdout repetitions (default 3). |
holdout.p |
Fraction of edges held out (default 0.1). |
soft |
Logical (default FALSE). |
fast |
Logical (default FALSE). |
The algorithm is based on foundation code from Lin Zhang's PhD dissertation.
A list with:
rank.sse, rank.dev, rank.auc
|
Selected ranks by each loss. |
sse, dev, auc
|
Average loss vectors. |
Given a linear index u over the upper triangle of a symmetric matrix
(column-major order), returns the corresponding row (x) and column
(y) indices. The mapping is independent of matrix size.
edge.index.map(u)edge.index.map(u)
u |
Integer vector of linear edge indices (1-based). |
A list with components x (row indices) and y (column indices).
edge.index.map(1:6)edge.index.map(1:6)
Estimates DCBM parameters using the row-norms of the leading eigenvectors for degree heterogeneity.
eigen.DCBM.est(A, g, n = nrow(A), K = max(g), psi.omit = 0, p.sample = 1)eigen.DCBM.est(A, g, n = nrow(A), K = max(g), psi.omit = 0, p.sample = 1)
A |
Adjacency matrix. |
g |
Community label vector. |
n |
Number of nodes. |
K |
Number of communities. |
psi.omit |
Number of leading nodes to exclude from psi estimation (used in CROISSANT overlapping designs). |
p.sample |
Sampling proportion for correction (default 1). |
A list with Bsum and psi.
Estimates the block sum matrix and degree heterogeneity parameters for a Degree-Corrected Block Model.
fast.DCBM.est(A, g, n = nrow(A), K = max(g), psi.omit = 0, p.sample = 1)fast.DCBM.est(A, g, n = nrow(A), K = max(g), psi.omit = 0, p.sample = 1)
A |
Adjacency matrix. |
g |
Community label vector. |
n |
Number of nodes. |
K |
Number of communities. |
psi.omit |
Number of leading nodes to exclude from psi estimation (used in CROISSANT overlapping designs). |
p.sample |
Sampling proportion for correction (default 1). |
A list with:
Bsum |
K x K block sum matrix (divided by |
psi |
Degree heterogeneity vector. |
Estimates the K x K block probability matrix from an adjacency matrix and community labels.
fast.SBM.est(A, g, n = nrow(A), K = max(g))fast.SBM.est(A, g, n = nrow(A), K = max(g))
A |
Adjacency matrix (n x n). |
g |
Integer community label vector (length n). |
n |
Number of nodes. |
K |
Number of communities. |
K x K estimated block probability matrix.
L2 loss between two matrices
l2(x, y)l2(x, y)
x, y
|
Numeric matrices of the same dimension. |
The Frobenius norm sqrt(sum((x - y)^2)).
Generate a latent space network
latent.gen(n, d, alpha = 1, sparsity = 1, ncore = 1)latent.gen(n, d, alpha = 1, sparsity = 1, ncore = 1)
n |
Number of nodes. |
d |
Latent dimension. |
alpha |
Intercept parameter controlling overall density. |
sparsity |
Sparsity multiplier. |
ncore |
Number of cores for parallel computation. |
A list with:
A |
Sparse symmetric adjacency matrix. |
Z |
n x d latent position matrix. |
Apply label permutation to match reference
matched.lab(lab, fixed, n = length(lab), K = max(lab, fixed))matched.lab(lab, fixed, n = length(lab), K = max(lab, fixed))
lab |
Integer vector of labels to relabel. |
fixed |
Integer vector of reference labels. |
n |
Number of nodes. |
K |
Number of communities. |
Relabelled integer vector aligned to fixed.
Selects both the model type (SBM vs DCBM) and number of communities by holding out random nodes and evaluating predictive performance on the held-out subgraph.
NCV.for.blockmodel(A, max.K, cv = 3, dc.est = 1)NCV.for.blockmodel(A, max.K, cv = 3, dc.est = 1)
A |
Adjacency matrix (n x n). |
max.K |
Maximum number of communities to consider. |
cv |
Number of node folds (default 3). |
dc.est |
DCBM estimation type (1 or 2; default 1). |
A list with:
dev, l2, auc
|
SBM loss vectors by K. |
dc.dev, dc.l2, dc.auc
|
DCBM loss vectors by K. |
l2.model, dev.model, auc.model
|
Selected model strings. |
Computes -n * log(p), returning 0 when p <= 0.
neglog(n = 1, p = 0.5)neglog(n = 1, p = 0.5)
n |
Numeric count. |
p |
Probability (between 0 and 1). |
Numeric value of -n * log(p) or 0 if p <= 0.
Performs f-fold node-split cross-validation to select the number of
communities k in a Stochastic Block Model. Nodes are randomly
partitioned into f folds; for each fold the held-out nodes are
assigned to communities via the training-set spectral clustering, and the
held-out negative log-likelihood and MSE are computed.
nscv.f.fold( A, k.vec = 2:6, restricted = TRUE, f = 10, method = "affinity", p.est.type = 3 )nscv.f.fold( A, k.vec = 2:6, restricted = TRUE, f = 10, method = "affinity", p.est.type = 3 )
A |
Symmetric adjacency matrix (n x n, binary). |
k.vec |
Integer vector of candidate community numbers (default 2:6). |
restricted |
Logical. If |
f |
Number of folds (default 10). |
method |
Inference method for assigning held-out nodes:
|
p.est.type |
How to re-estimate the probability matrix for evaluation: 1 = training only, 2 = training + testing, 3 = testing only (default 3). |
A list with:
k.loss |
Selected |
k.mse |
Selected |
cv.loss |
Average CV negative log-likelihood for each |
cv.mse |
Average CV MSE for each |
set.seed(42) net <- community.sim(k = 3, n = 150, n1 = 50, p = 0.5, q = 0.1) result <- nscv.f.fold(net$adjacency, k.vec = 2:5, f = 5) result$k.lossset.seed(42) net <- community.sim(k = 3, n = 150, n1 = 50, p = 0.5, q = 0.1) result <- nscv.f.fold(net$adjacency, k.vec = 2:5, f = 5) result$k.loss
Performs repeated random-split node-level cross-validation. At each
iteration a random fraction split of nodes is used for training.
nscv.random.split( A, k.vec = 2:6, restricted = TRUE, split = 0.66, ite = 100, method = "affinity", p.est.type = 3 )nscv.random.split( A, k.vec = 2:6, restricted = TRUE, split = 0.66, ite = 100, method = "affinity", p.est.type = 3 )
A |
Symmetric adjacency matrix (n x n, binary). |
k.vec |
Integer vector of candidate community numbers (default 2:6). |
restricted |
Logical. If |
split |
Fraction of nodes used for training (default 0.66). |
ite |
Number of random split iterations (default 100). |
method |
Inference method for assigning held-out nodes:
|
p.est.type |
How to re-estimate the probability matrix for evaluation: 1 = training only, 2 = training + testing, 3 = testing only (default 3). |
A list with:
k.loss |
Selected |
k.mse |
Selected |
cv.loss |
Average CV negative log-likelihood for each |
cv.mse |
Average CV MSE for each |
Given a clustering and adjacency matrix, estimates the block probability matrix and computes the negative log-likelihood.
SBM.prob(cluster, k, A, restricted = TRUE)SBM.prob(cluster, k, A, restricted = TRUE)
cluster |
Integer vector of cluster labels. |
k |
Number of clusters. |
A |
Adjacency matrix corresponding to the nodes in |
restricted |
Logical. If |
A list with:
p.matrix |
k x k estimated block probability matrix. |
negloglike |
Negative log-likelihood of the model. |
Performs spectral clustering on an adjacency matrix by computing the top
k singular vectors and applying k-means++ via ClusterR::KMeans_rcpp.
SBM.spectral.clustering(A, k = 2)SBM.spectral.clustering(A, k = 2)
A |
Symmetric adjacency matrix (n x n, binary or weighted). |
k |
Number of clusters (default 2). |
A list with component cluster, an integer vector of length n.
net <- community.sim(k = 3, n = 90, n1 = 30, p = 0.5, q = 0.1) cl <- SBM.spectral.clustering(net$adjacency, k = 3) table(cl$cluster, net$membership)net <- community.sim(k = 3, n = 90, n1 = 30, p = 0.5, q = 0.1) cl <- SBM.spectral.clustering(net$adjacency, k = 3) table(cl$cluster, net$membership)
Generate a sparse RDPG network
sparse.RDPG.gen(n, d, sparsity.multiplier = 1, ncore = 1)sparse.RDPG.gen(n, d, sparsity.multiplier = 1, ncore = 1)
n |
Number of nodes. |
d |
Latent dimension. |
sparsity.multiplier |
Multiplier to control network density. |
ncore |
Number of cores for parallel computation. |
A list with:
A |
Sparse symmetric adjacency matrix. |
P |
n x n probability matrix. |