geomexp.evaluation package¶

Submodules¶

geomexp.evaluation.metrics module¶

Evaluation metrics for clustering experiments.

Provides wrappers around scikit-learn metrics plus custom diagnostics for bootstrap stability and cluster shape analysis.

geomexp.evaluation.metrics.adjusted_rand_index(y_true, y_pred)[source]¶

Adjusted Rand Index between two partitions.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

ARI in \([-1, 1]\) (1 = perfect agreement).

geomexp.evaluation.metrics.normalized_mutual_info(y_true, y_pred)[source]¶

Normalised Mutual Information between two partitions.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

NMI in \([0, 1]\) (1 = perfect agreement).

geomexp.evaluation.metrics.silhouette(X, labels)[source]¶

Mean silhouette coefficient.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
labels (ndarray[tuple[Any, ...], dtype[int_]]) – Cluster labels.

Return type:

float

Returns:

Silhouette score in \([-1, 1]\).

geomexp.evaluation.metrics.davies_bouldin(X, labels)[source]¶

Davies–Bouldin index (lower is better).

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
labels (ndarray[tuple[Any, ...], dtype[int_]]) – Cluster labels.

Return type:

float

Returns:

Davies–Bouldin score (non-negative; 0 is ideal).

geomexp.evaluation.metrics.stability_score(X, clusterer_factory, n_resamples=20, subsample_frac=0.8, rng=None)[source]¶

Bootstrap stability of a clustering method.

Repeatedly draws sub-samples, fits the clusterer on each, and measures the mean pairwise ARI on the intersection of each pair of sub-samples.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
clusterer_factory (Callable[[], object]) – Zero-argument callable returning a fresh clusterer instance (must have a .fit(X) method returning a ClusterResult).
n_resamples (int) – Number of bootstrap resamples.
subsample_frac (float) – Fraction of data to draw per resample.
rng (Generator | None) – Optional random generator.

Return type:

float

Returns:

Mean pairwise ARI across resamples (higher = more stable).

geomexp.evaluation.metrics.radius_ratio(X, centers, assignments)[source]¶

Per-cluster radius ratio: max distance / median distance to centroid.

A large ratio indicates elongated or “tendril”-like cluster capture regions.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
centers (ndarray[tuple[Any, ...], dtype[double]]) – Cluster centres of shape (n_clusters, n_features).
assignments (ndarray[tuple[Any, ...], dtype[int_]]) – Cluster labels of shape (n_samples,).

Return type:

ndarray[tuple[Any, ...], dtype[double]]

Returns:

Array of shape (n_clusters,) with the radius ratio for each cluster.

geomexp.evaluation.metrics.variation_of_information(y_true, y_pred)[source]¶

Variation of Information between two partitions.

Defined as \(\mathrm{VI}(U, V) = H(U) + H(V) - 2\,I(U, V)\), using natural logs.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

Non-negative VI, with 0 indicating identical partitions.

geomexp.evaluation.metrics.misclassification_error(y_true, y_pred)[source]¶

Minimum misclassification error under optimal label permutation.

Uses the Hungarian algorithm to find the permutation of predicted labels that maximises agreement with the ground truth.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

Misclassification rate in \([0, 1]\) (0 = perfect agreement).

geomexp.evaluation.metrics.run_methods(X, methods, n_inits=20, base_seed=0)[source]¶

Fit several clustering methods, keeping the best-of-n_inits run.

Each entry in methods is a dict with keys "name", "cls", and "kwargs" (passed to the constructor). For each method the algorithm is re-initialised n_inits times (via random_state) and the run with the lowest objective is kept.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array.
methods (list[dict[str, object]]) – List of method specifications.
n_inits (int) – Number of random restarts per method.
base_seed (int) – Base random seed (incremented per restart).

Return type:

dict[str, ClusterResult]

Returns:

Dict mapping method name to its best ClusterResult.

Module contents¶

Evaluation utilities for clustering experiments.

geomexp.evaluation.adjusted_rand_index(y_true, y_pred)[source]¶

Adjusted Rand Index between two partitions.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

ARI in \([-1, 1]\) (1 = perfect agreement).

geomexp.evaluation.davies_bouldin(X, labels)[source]¶

Davies–Bouldin index (lower is better).

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
labels (ndarray[tuple[Any, ...], dtype[int_]]) – Cluster labels.

Return type:

float

Returns:

Davies–Bouldin score (non-negative; 0 is ideal).

geomexp.evaluation.misclassification_error(y_true, y_pred)[source]¶

Minimum misclassification error under optimal label permutation.

Uses the Hungarian algorithm to find the permutation of predicted labels that maximises agreement with the ground truth.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

Misclassification rate in \([0, 1]\) (0 = perfect agreement).

geomexp.evaluation.normalized_mutual_info(y_true, y_pred)[source]¶

Normalised Mutual Information between two partitions.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

NMI in \([0, 1]\) (1 = perfect agreement).

geomexp.evaluation.radius_ratio(X, centers, assignments)[source]¶

Per-cluster radius ratio: max distance / median distance to centroid.

A large ratio indicates elongated or “tendril”-like cluster capture regions.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
centers (ndarray[tuple[Any, ...], dtype[double]]) – Cluster centres of shape (n_clusters, n_features).
assignments (ndarray[tuple[Any, ...], dtype[int_]]) – Cluster labels of shape (n_samples,).

Return type:

ndarray[tuple[Any, ...], dtype[double]]

Returns:

Array of shape (n_clusters,) with the radius ratio for each cluster.

geomexp.evaluation.run_methods(X, methods, n_inits=20, base_seed=0)[source]¶

Fit several clustering methods, keeping the best-of-n_inits run.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array.
methods (list[dict[str, object]]) – List of method specifications.
n_inits (int) – Number of random restarts per method.
base_seed (int) – Base random seed (incremented per restart).

Return type:

dict[str, ClusterResult]

Returns:

Dict mapping method name to its best ClusterResult.

geomexp.evaluation.silhouette(X, labels)[source]¶

Mean silhouette coefficient.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
labels (ndarray[tuple[Any, ...], dtype[int_]]) – Cluster labels.

Return type:

float

Returns:

Silhouette score in \([-1, 1]\).

geomexp.evaluation.stability_score(X, clusterer_factory, n_resamples=20, subsample_frac=0.8, rng=None)[source]¶

Bootstrap stability of a clustering method.

Repeatedly draws sub-samples, fits the clusterer on each, and measures the mean pairwise ARI on the intersection of each pair of sub-samples.

Parameters:

X (ndarray[tuple[Any, ...], dtype[double]]) – Data array of shape (n_samples, n_features).
clusterer_factory (Callable[[], object]) – Zero-argument callable returning a fresh clusterer instance (must have a .fit(X) method returning a ClusterResult).
n_resamples (int) – Number of bootstrap resamples.
subsample_frac (float) – Fraction of data to draw per resample.
rng (Generator | None) – Optional random generator.

Return type:

float

Returns:

Mean pairwise ARI across resamples (higher = more stable).

geomexp.evaluation.variation_of_information(y_true, y_pred)[source]¶

Variation of Information between two partitions.

Defined as \(\mathrm{VI}(U, V) = H(U) + H(V) - 2\,I(U, V)\), using natural logs.

Parameters:

y_true (ndarray[tuple[Any, ...], dtype[int_]]) – Ground-truth cluster labels.
y_pred (ndarray[tuple[Any, ...], dtype[int_]]) – Predicted cluster labels.

Return type:

float

Returns:

Non-negative VI, with 0 indicating identical partitions.