geomexp.clustering package¶
Submodules¶
geomexp.clustering.clustering_algorithms module¶
Clustering algorithm implementations.
This module provides K-means and geometric expectile clustering following Algorithm 1 of the thesis. The expectile algorithm supports pluggable index vector strategies, tie-breaking rules, and empty-cluster policies.
- class geomexp.clustering.clustering_algorithms.KMeans(n_clusters, max_iter=100, tol=0.0001, random_state=None)[source]¶
Bases:
IterativeClustererK-means clustering algorithm.
Partitions data into \(K\) clusters by minimising the within-cluster sum of squared distances via Lloyd’s alternating minimisation:
\[J = \sum_{k=1}^K \sum_{i \in C_k} \|x_i - m_k\|^2\]where \(C_k\) denotes the set of points assigned to cluster \(k\), and \(m_k\) is the cluster mean.
- n_clusters¶
Number of clusters to form.
- max_iter¶
Maximum number of iterations.
- tol¶
Convergence tolerance on objective change.
- random_state¶
Random seed for reproducibility.
Example
>>> import numpy as np >>> X = np.random.randn(100, 2) >>> kmeans = KMeans(n_clusters=3, random_state=42) >>> result = kmeans.fit(X) >>> print(result.assignments.shape) (100,)
- class geomexp.clustering.clustering_algorithms.GeometricExpectileClustering(n_clusters, index_radius=0.5, index_strategy=None, tie_break_rule=None, empty_cluster_rule=None, geometry=None, n_init=10, max_iter=100, tol=1e-07, center_lr=0.1, center_steps=50, random_state=None)[source]¶
Bases:
BaseClustererGeometric expectile clustering with directional asymmetry (Algorithm 1).
A Lloyd-type block coordinate descent algorithm that minimises the empirical risk based on the geometric expectile loss:
\[\hat{\mathcal{R}}(\Pi, C, U) = \frac{1}{n} \sum_{k=1}^K \sum_{i=1}^n \ell_{u_k}(X_i - c_k)\,\mathbb{1}_{\{X_i \in C_k\}}, \quad \|u_k\| = r \;\forall k.\]Each iteration consists of:
Assignment – assign each point to the cluster minimising \(\ell_{u_k}(X_i - c_k)\).
Empty-cluster handling – reinitialise empty clusters, then re-assign.
Centroid update – for each cluster, minimise the cluster-wise risk over \(c\) via gradient descent.
Index update – update \(U\) via the chosen
IndexStrategy.
Setting \(r = 0\) recovers standard K-means.
- Parameters:
n_clusters (int)
index_radius (float)
index_strategy (IndexStrategy | None)
tie_break_rule (TieBreakRule | None)
empty_cluster_rule (EmptyClusterRule | None)
geometry (HilbertGeometry | None)
n_init (int)
max_iter (int)
tol (float)
center_lr (float)
center_steps (int)
random_state (int | None)
- n_clusters¶
Number of clusters.
- index_radius¶
Radius constraint \(r \in [0, 1)\).
- index_strategy¶
Strategy for index vector selection.
- tie_break_rule¶
Rule for breaking ties in assignment.
- empty_cluster_rule¶
Policy for handling empty clusters.
- n_init¶
Number of random restarts (best result is kept).
- max_iter¶
Maximum number of outer iterations.
- tol¶
Convergence tolerance on objective change.
- center_lr¶
Learning rate for gradient descent on centers.
- center_steps¶
Number of gradient steps per centroid update.
- random_state¶
Random seed.
Example
>>> import numpy as np >>> X = np.random.randn(100, 2) >>> model = GeometricExpectileClustering(n_clusters=3, index_radius=0.5) >>> result = model.fit(X) >>> print(result.metadata["indices"].shape) (3, 2)
- fit(X)[source]¶
Fit geometric expectile clustering with multiple random restarts.
Runs
n_initindependent trials with different random seeds and returns the result with the lowest objective.- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).- Return type:
- Returns:
ClusterResult from the best restart.
geomexp.clustering.clustering_base module¶
Base classes and utilities for clustering algorithms.
This module provides abstract base classes, common utilities, and the geometric expectile loss function for implementing clustering algorithms with a consistent interface.
- geomexp.clustering.clustering_base.expectile_loss(residuals, index_vectors, geometry=None)[source]¶
Compute the geometric expectile loss.
The loss function is:
\[\ell_u(t) = \frac{1}{2}\bigl(\|t\|_H^2 + \|t\|_H\,\langle u, t \rangle_H\bigr)\]for \((t, u) \in \mathcal{X} \times \mathcal{B}\), as introduced by Herrmann et al. (2018). When
geometryisNonethe standard Euclidean inner product is used.- Parameters:
residuals (
ndarray) – Residual vectors \(t = x - c\), shape(n, d)or(d,).index_vectors (
ndarray) – Index vector \(u\), broadcastable toresiduals. Shape(d,)for a single shared index, or(n, d)for per-sample indices.geometry (
HilbertGeometry|None) – Hilbert space geometry.None(default) uses Euclidean.
- Return type:
- Returns:
Loss values of shape
(n,)(or scalar if input was 1-D).
- class geomexp.clustering.clustering_base.ClusterResult(assignments, centers, objective, n_iterations, converged=True, metadata=None)[source]¶
Bases:
objectContainer for clustering results.
- Parameters:
- assignments¶
Array of shape
(n_samples,)containing cluster indices for each point.
- centers¶
Array of shape
(n_clusters, n_features)containing cluster centers.
- objective¶
Final objective function value achieved by the algorithm.
- n_iterations¶
Number of iterations performed until convergence.
- converged¶
Whether the algorithm converged within
max_iteriterations.
- metadata¶
Optional dictionary containing algorithm-specific information.
- class geomexp.clustering.clustering_base.BaseClusterer(n_clusters, max_iter=100, tol=1e-07, random_state=None)[source]¶
Bases:
ABCAbstract base class for clustering algorithms.
Defines the interface that all clustering algorithms must implement. Subclasses provide
_initialize,_fit_iteration,_compute_objective, and_extract_result.The
fitmethod contains the outer optimisation loop with convergence checking based on objective change and an optional partition stability hook.- n_clusters¶
Number of clusters to form.
- max_iter¶
Maximum number of iterations to perform.
- tol¶
Convergence tolerance for objective function change.
- random_state¶
Random seed for reproducibility.
- fit(X)[source]¶
Fit the clustering algorithm to data.
The outer loop follows Algorithm 1 from the thesis: iterate until the objective decrease falls below
tol, an additional convergence criterion is met, ormax_iteriterations are reached.- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).- Return type:
- Returns:
ClusterResult containing assignments, centers, and convergence info.
- Raises:
ValueError – If
Xhas fewer samples thann_clustersor invalid shape.
- class geomexp.clustering.clustering_base.IterativeClusterer(n_clusters, max_iter=100, tol=1e-07, random_state=None)[source]¶
Bases:
BaseClustererBase class for iterative clustering algorithms with assignment-update steps.
Provides a template for Lloyd-type algorithms that alternate between assigning points to clusters, handling empty clusters, and updating cluster parameters.
geomexp.clustering.geometry module¶
Hilbert space geometry for clustering algorithms.
Provides abstractions for inner products and norms so that the same clustering algorithms work unchanged in Euclidean space, weighted \(L^2\) (functional data on grids), or general Gram-matrix spaces (basis representations).
The three concrete geometries cover the common cases:
EuclideanGeometry– standard dot product (identity weights).WeightedEuclideanGeometry– diagonal quadrature weights, approximating \(\langle f, g \rangle_{L^2} = \int f(t)\,g(t)\,\mathrm{d}t\).GramGeometry– full Gram matrix \(G_{ij} = \langle \varphi_i, \varphi_j \rangle\) for basis coefficient representations.
- class geomexp.clustering.geometry.HilbertGeometry[source]¶
Bases:
ABCAbstract Hilbert space geometry defining inner product and norm.
All clustering geometry (assignment costs, center gradients, index strategies) flows through this interface, ensuring that no part of the algorithm accidentally falls back to Euclidean operations.
- abstractmethod inner(a, b)[source]¶
Compute the inner product \(\langle a, b \rangle_H\) along the last axis.
- class geomexp.clustering.geometry.EuclideanGeometry[source]¶
Bases:
HilbertGeometryStandard Euclidean geometry: \(\langle a, b \rangle = \sum_i a_i b_i\).
Equivalent to identity weights. This is the default geometry when none is specified.
- class geomexp.clustering.geometry.WeightedEuclideanGeometry(weights)[source]¶
Bases:
HilbertGeometryDiagonal-weighted geometry: \(\langle a, b \rangle_w = \sum_i w_i\, a_i\, b_i\).
Suitable for functional data discretised on a grid with quadrature weights \(w_i\), approximating
\[\langle f, g \rangle_{L^2} = \int f(t)\, g(t)\, \mathrm{d}t \;\approx\; \sum_i w_i\, f(t_i)\, g(t_i).\]- Parameters:
weights (
ndarray) – Positive quadrature weight array of shape(d,).
- class geomexp.clustering.geometry.GramGeometry(gram_matrix)[source]¶
Bases:
HilbertGeometryGeneral Gram-matrix geometry: \(\langle a, b \rangle_G = a^\top G\, b\).
Suitable for basis coefficient representations where \(G_{ij} = \langle \varphi_i, \varphi_j \rangle\).
- Parameters:
gram_matrix (
ndarray) – Symmetric positive-definite matrix of shape(d, d).
geomexp.clustering.index_strategies module¶
Index vector strategies for geometric expectile clustering.
The index vectors \((u_1, \ldots, u_K)\) control the directional asymmetry of the geometric expectile loss and hence the cluster geometry. This module provides several strategies for choosing these vectors, as described in Section 2.1.2 of the thesis.
- class geomexp.clustering.index_strategies.IndexStrategy[source]¶
Bases:
ABCAbstract base class for index vector selection strategies.
Index vectors \(u_k\) with \(\|u_k\|_H = r\) determine the directional asymmetry of the geometric expectile loss. Different strategies offer different trade-offs between flexibility and the need for prior knowledge.
- abstractmethod initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- abstractmethod update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.index_strategies.GlobalIndexStrategy(direction, geometry=None)[source]¶
Bases:
IndexStrategyFixed single global index vector shared by all clusters.
Sets \(u_1 = \cdots = u_K = u\) for a given direction \(u\). Appropriate when a meaningful asymmetry direction is known a priori; one implicitly assumes a common anisotropy mechanism across clusters.
The provided direction is normalised to have Hilbert norm equal to \(r\).
- Parameters:
direction (np.ndarray)
geometry (HilbertGeometry | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.index_strategies.ClusterSpecificIndexStrategy(directions, geometry=None)[source]¶
Bases:
IndexStrategyFixed per-cluster index vectors.
Distinct fixed directions are prescribed for different clusters, each normalised to \(\|u_k\|_H = r\). Appropriate when prior information about cluster-specific asymmetry directions is available.
Zero-norm rows in the provided directions result in \(u_k = 0\).
- Parameters:
directions (np.ndarray)
geometry (HilbertGeometry | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.index_strategies.BestResponseIndexStrategy(geometry=None)[source]¶
Bases:
IndexStrategyAdaptive best-response index vectors (equation 2.2).
Each index vector is set to minimise the cluster-wise risk for fixed partition and centroids. The closed-form solution in a Hilbert space \(H\) is:
\[u_k^* = -r \frac{s_k}{\|s_k\|_H}, \quad s_k = \sum_{i : X_i \in C_k} \|X_i - c_k\|_H \, (X_i - c_k).\]This captures the length-weighted residual direction (skew) of the cluster. When \(s_k = 0\), the current index is retained.
- Parameters:
geometry (HilbertGeometry | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.index_strategies.CustomIndexStrategy(update_fn, init_fn=None)[source]¶
Bases:
IndexStrategyUser-provided callable for custom index vector updates.
Allows full flexibility by accepting arbitrary callables for initialisation and update of index vectors.
The update callable should have the signature:
def update_fn( X: np.ndarray, # (n_samples, n_features) assignments: np.ndarray, # (n_samples,) centers: np.ndarray, # (n_clusters, n_features) indices: np.ndarray, # (n_clusters, n_features) index_radius: float, ) -> np.ndarray: # (n_clusters, n_features)
The optional init callable should have the signature:
def init_fn( n_clusters: int, n_features: int, index_radius: float, rng: np.random.RandomState, ) -> np.ndarray: # (n_clusters, n_features)
- Parameters:
update_fn (Callable[..., np.ndarray])
init_fn (Callable[..., np.ndarray] | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
geomexp.clustering.policies module¶
Tie-breaking and empty cluster policies for clustering algorithms.
This module provides pluggable policy objects that control tie-breaking during cluster assignment and the handling of empty clusters.
- class geomexp.clustering.policies.TieBreakRule[source]¶
Bases:
ABCAbstract base class for tie-breaking rules in cluster assignment.
When a data point has equal cost to multiple clusters, the tie-break rule determines which cluster is chosen.
- class geomexp.clustering.policies.LowestIndexTieBreak[source]¶
Bases:
TieBreakRuleBreak ties by choosing the lowest cluster index.
This is the default behavior of
np.argminand provides deterministic, reproducible assignment independent of the random state.
- class geomexp.clustering.policies.RandomTieBreak[source]¶
Bases:
TieBreakRuleBreak ties uniformly at random among clusters achieving minimal cost.
- class geomexp.clustering.policies.EmptyClusterRule[source]¶
Bases:
ABCAbstract base class for handling empty clusters.
After the assignment step, some clusters may become empty. This policy determines how to reinitialize their centers and index vectors so that the subsequent reassignment can populate them.
- class geomexp.clustering.policies.RandomReinitRule[source]¶
Bases:
EmptyClusterRuleReinitialize each empty cluster center to a randomly chosen data point.
Index vectors for reinitialized clusters are reset to zero (symmetric). This is the simplest and most common approach.
- class geomexp.clustering.policies.FarthestPointRule(geometry=None)[source]¶
Bases:
EmptyClusterRuleReinitialize each empty cluster to the point farthest from its assigned center.
Distances are measured in the Hilbert geometry provided at construction (Euclidean by default). This tends to spread clusters apart and can help escape degenerate local optima more aggressively than random reinitialization.
- Parameters:
geometry (HilbertGeometry | None)
Module contents¶
Clustering algorithms module.
- class geomexp.clustering.BaseClusterer(n_clusters, max_iter=100, tol=1e-07, random_state=None)[source]¶
Bases:
ABCAbstract base class for clustering algorithms.
Defines the interface that all clustering algorithms must implement. Subclasses provide
_initialize,_fit_iteration,_compute_objective, and_extract_result.The
fitmethod contains the outer optimisation loop with convergence checking based on objective change and an optional partition stability hook.- n_clusters¶
Number of clusters to form.
- max_iter¶
Maximum number of iterations to perform.
- tol¶
Convergence tolerance for objective function change.
- random_state¶
Random seed for reproducibility.
- fit(X)[source]¶
Fit the clustering algorithm to data.
The outer loop follows Algorithm 1 from the thesis: iterate until the objective decrease falls below
tol, an additional convergence criterion is met, ormax_iteriterations are reached.- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).- Return type:
- Returns:
ClusterResult containing assignments, centers, and convergence info.
- Raises:
ValueError – If
Xhas fewer samples thann_clustersor invalid shape.
- class geomexp.clustering.BestResponseIndexStrategy(geometry=None)[source]¶
Bases:
IndexStrategyAdaptive best-response index vectors (equation 2.2).
Each index vector is set to minimise the cluster-wise risk for fixed partition and centroids. The closed-form solution in a Hilbert space \(H\) is:
\[u_k^* = -r \frac{s_k}{\|s_k\|_H}, \quad s_k = \sum_{i : X_i \in C_k} \|X_i - c_k\|_H \, (X_i - c_k).\]This captures the length-weighted residual direction (skew) of the cluster. When \(s_k = 0\), the current index is retained.
- Parameters:
geometry (HilbertGeometry | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.ClusterResult(assignments, centers, objective, n_iterations, converged=True, metadata=None)[source]¶
Bases:
objectContainer for clustering results.
- Parameters:
- assignments¶
Array of shape
(n_samples,)containing cluster indices for each point.
- centers¶
Array of shape
(n_clusters, n_features)containing cluster centers.
- objective¶
Final objective function value achieved by the algorithm.
- n_iterations¶
Number of iterations performed until convergence.
- converged¶
Whether the algorithm converged within
max_iteriterations.
- metadata¶
Optional dictionary containing algorithm-specific information.
- class geomexp.clustering.ClusterSpecificIndexStrategy(directions, geometry=None)[source]¶
Bases:
IndexStrategyFixed per-cluster index vectors.
Distinct fixed directions are prescribed for different clusters, each normalised to \(\|u_k\|_H = r\). Appropriate when prior information about cluster-specific asymmetry directions is available.
Zero-norm rows in the provided directions result in \(u_k = 0\).
- Parameters:
directions (np.ndarray)
geometry (HilbertGeometry | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.CustomIndexStrategy(update_fn, init_fn=None)[source]¶
Bases:
IndexStrategyUser-provided callable for custom index vector updates.
Allows full flexibility by accepting arbitrary callables for initialisation and update of index vectors.
The update callable should have the signature:
def update_fn( X: np.ndarray, # (n_samples, n_features) assignments: np.ndarray, # (n_samples,) centers: np.ndarray, # (n_clusters, n_features) indices: np.ndarray, # (n_clusters, n_features) index_radius: float, ) -> np.ndarray: # (n_clusters, n_features)
The optional init callable should have the signature:
def init_fn( n_clusters: int, n_features: int, index_radius: float, rng: np.random.RandomState, ) -> np.ndarray: # (n_clusters, n_features)
- Parameters:
update_fn (Callable[..., np.ndarray])
init_fn (Callable[..., np.ndarray] | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.EmptyClusterRule[source]¶
Bases:
ABCAbstract base class for handling empty clusters.
After the assignment step, some clusters may become empty. This policy determines how to reinitialize their centers and index vectors so that the subsequent reassignment can populate them.
- class geomexp.clustering.EuclideanGeometry[source]¶
Bases:
HilbertGeometryStandard Euclidean geometry: \(\langle a, b \rangle = \sum_i a_i b_i\).
Equivalent to identity weights. This is the default geometry when none is specified.
- class geomexp.clustering.FarthestPointRule(geometry=None)[source]¶
Bases:
EmptyClusterRuleReinitialize each empty cluster to the point farthest from its assigned center.
Distances are measured in the Hilbert geometry provided at construction (Euclidean by default). This tends to spread clusters apart and can help escape degenerate local optima more aggressively than random reinitialization.
- Parameters:
geometry (HilbertGeometry | None)
- class geomexp.clustering.GeometricExpectileClustering(n_clusters, index_radius=0.5, index_strategy=None, tie_break_rule=None, empty_cluster_rule=None, geometry=None, n_init=10, max_iter=100, tol=1e-07, center_lr=0.1, center_steps=50, random_state=None)[source]¶
Bases:
BaseClustererGeometric expectile clustering with directional asymmetry (Algorithm 1).
A Lloyd-type block coordinate descent algorithm that minimises the empirical risk based on the geometric expectile loss:
\[\hat{\mathcal{R}}(\Pi, C, U) = \frac{1}{n} \sum_{k=1}^K \sum_{i=1}^n \ell_{u_k}(X_i - c_k)\,\mathbb{1}_{\{X_i \in C_k\}}, \quad \|u_k\| = r \;\forall k.\]Each iteration consists of:
Assignment – assign each point to the cluster minimising \(\ell_{u_k}(X_i - c_k)\).
Empty-cluster handling – reinitialise empty clusters, then re-assign.
Centroid update – for each cluster, minimise the cluster-wise risk over \(c\) via gradient descent.
Index update – update \(U\) via the chosen
IndexStrategy.
Setting \(r = 0\) recovers standard K-means.
- Parameters:
n_clusters (int)
index_radius (float)
index_strategy (IndexStrategy | None)
tie_break_rule (TieBreakRule | None)
empty_cluster_rule (EmptyClusterRule | None)
geometry (HilbertGeometry | None)
n_init (int)
max_iter (int)
tol (float)
center_lr (float)
center_steps (int)
random_state (int | None)
- n_clusters¶
Number of clusters.
- index_radius¶
Radius constraint \(r \in [0, 1)\).
- index_strategy¶
Strategy for index vector selection.
- tie_break_rule¶
Rule for breaking ties in assignment.
- empty_cluster_rule¶
Policy for handling empty clusters.
- n_init¶
Number of random restarts (best result is kept).
- max_iter¶
Maximum number of outer iterations.
- tol¶
Convergence tolerance on objective change.
- center_lr¶
Learning rate for gradient descent on centers.
- center_steps¶
Number of gradient steps per centroid update.
- random_state¶
Random seed.
Example
>>> import numpy as np >>> X = np.random.randn(100, 2) >>> model = GeometricExpectileClustering(n_clusters=3, index_radius=0.5) >>> result = model.fit(X) >>> print(result.metadata["indices"].shape) (3, 2)
- fit(X)[source]¶
Fit geometric expectile clustering with multiple random restarts.
Runs
n_initindependent trials with different random seeds and returns the result with the lowest objective.- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).- Return type:
- Returns:
ClusterResult from the best restart.
- class geomexp.clustering.GlobalIndexStrategy(direction, geometry=None)[source]¶
Bases:
IndexStrategyFixed single global index vector shared by all clusters.
Sets \(u_1 = \cdots = u_K = u\) for a given direction \(u\). Appropriate when a meaningful asymmetry direction is known a priori; one implicitly assumes a common anisotropy mechanism across clusters.
The provided direction is normalised to have Hilbert norm equal to \(r\).
- Parameters:
direction (np.ndarray)
geometry (HilbertGeometry | None)
- initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.GramGeometry(gram_matrix)[source]¶
Bases:
HilbertGeometryGeneral Gram-matrix geometry: \(\langle a, b \rangle_G = a^\top G\, b\).
Suitable for basis coefficient representations where \(G_{ij} = \langle \varphi_i, \varphi_j \rangle\).
- Parameters:
gram_matrix (
ndarray) – Symmetric positive-definite matrix of shape(d, d).
- class geomexp.clustering.HilbertGeometry[source]¶
Bases:
ABCAbstract Hilbert space geometry defining inner product and norm.
All clustering geometry (assignment costs, center gradients, index strategies) flows through this interface, ensuring that no part of the algorithm accidentally falls back to Euclidean operations.
- abstractmethod inner(a, b)[source]¶
Compute the inner product \(\langle a, b \rangle_H\) along the last axis.
- class geomexp.clustering.IndexStrategy[source]¶
Bases:
ABCAbstract base class for index vector selection strategies.
Index vectors \(u_k\) with \(\|u_k\|_H = r\) determine the directional asymmetry of the geometric expectile loss. Different strategies offer different trade-offs between flexibility and the need for prior knowledge.
- abstractmethod initialize(n_clusters, n_features, index_radius, rng)[source]¶
Create initial index vectors.
- Parameters:
n_clusters (
int) – Number of clusters.n_features (
int) – Dimensionality of the data.index_radius (
float) – Radius constraint \(r\) with \(\|u_k\|_H = r\).rng (
RandomState) – Random state for stochastic initialization.
- Return type:
- Returns:
Array of shape
(n_clusters, n_features).
- abstractmethod update(X, assignments, centers, indices, index_radius)[source]¶
Update index vectors given the current clustering state.
- Parameters:
X (
ndarray) – Data array of shape(n_samples, n_features).assignments (
ndarray) – Cluster assignments of shape(n_samples,).centers (
ndarray) – Cluster centers of shape(n_clusters, n_features).indices (
ndarray) – Current index vectors of shape(n_clusters, n_features).index_radius (
float) – Radius constraint \(r\).
- Return type:
- Returns:
Updated index vectors of shape
(n_clusters, n_features).
- class geomexp.clustering.IterativeClusterer(n_clusters, max_iter=100, tol=1e-07, random_state=None)[source]¶
Bases:
BaseClustererBase class for iterative clustering algorithms with assignment-update steps.
Provides a template for Lloyd-type algorithms that alternate between assigning points to clusters, handling empty clusters, and updating cluster parameters.
- class geomexp.clustering.KMeans(n_clusters, max_iter=100, tol=0.0001, random_state=None)[source]¶
Bases:
IterativeClustererK-means clustering algorithm.
Partitions data into \(K\) clusters by minimising the within-cluster sum of squared distances via Lloyd’s alternating minimisation:
\[J = \sum_{k=1}^K \sum_{i \in C_k} \|x_i - m_k\|^2\]where \(C_k\) denotes the set of points assigned to cluster \(k\), and \(m_k\) is the cluster mean.
- n_clusters¶
Number of clusters to form.
- max_iter¶
Maximum number of iterations.
- tol¶
Convergence tolerance on objective change.
- random_state¶
Random seed for reproducibility.
Example
>>> import numpy as np >>> X = np.random.randn(100, 2) >>> kmeans = KMeans(n_clusters=3, random_state=42) >>> result = kmeans.fit(X) >>> print(result.assignments.shape) (100,)
- class geomexp.clustering.LowestIndexTieBreak[source]¶
Bases:
TieBreakRuleBreak ties by choosing the lowest cluster index.
This is the default behavior of
np.argminand provides deterministic, reproducible assignment independent of the random state.
- class geomexp.clustering.RandomReinitRule[source]¶
Bases:
EmptyClusterRuleReinitialize each empty cluster center to a randomly chosen data point.
Index vectors for reinitialized clusters are reset to zero (symmetric). This is the simplest and most common approach.
- class geomexp.clustering.RandomTieBreak[source]¶
Bases:
TieBreakRuleBreak ties uniformly at random among clusters achieving minimal cost.
- class geomexp.clustering.TieBreakRule[source]¶
Bases:
ABCAbstract base class for tie-breaking rules in cluster assignment.
When a data point has equal cost to multiple clusters, the tie-break rule determines which cluster is chosen.
- class geomexp.clustering.WeightedEuclideanGeometry(weights)[source]¶
Bases:
HilbertGeometryDiagonal-weighted geometry: \(\langle a, b \rangle_w = \sum_i w_i\, a_i\, b_i\).
Suitable for functional data discretised on a grid with quadrature weights \(w_i\), approximating
\[\langle f, g \rangle_{L^2} = \int f(t)\, g(t)\, \mathrm{d}t \;\approx\; \sum_i w_i\, f(t_i)\, g(t_i).\]- Parameters:
weights (
ndarray) – Positive quadrature weight array of shape(d,).
- geomexp.clustering.expectile_loss(residuals, index_vectors, geometry=None)[source]¶
Compute the geometric expectile loss.
The loss function is:
\[\ell_u(t) = \frac{1}{2}\bigl(\|t\|_H^2 + \|t\|_H\,\langle u, t \rangle_H\bigr)\]for \((t, u) \in \mathcal{X} \times \mathcal{B}\), as introduced by Herrmann et al. (2018). When
geometryisNonethe standard Euclidean inner product is used.- Parameters:
residuals (
ndarray) – Residual vectors \(t = x - c\), shape(n, d)or(d,).index_vectors (
ndarray) – Index vector \(u\), broadcastable toresiduals. Shape(d,)for a single shared index, or(n, d)for per-sample indices.geometry (
HilbertGeometry|None) – Hilbert space geometry.None(default) uses Euclidean.
- Return type:
- Returns:
Loss values of shape
(n,)(or scalar if input was 1-D).