Labelling (Stage D)

HDBSCAN clustering and cluster quality evaluation (Stage D).

Takes UMAP embeddings produced by Stage C and clusters them using HDBSCAN over a parameter grid. For each (embedding, HDBSCAN params) combination, cluster labels and quality metrics are saved to HDF5 files. A PDF summary report ranks parameter combinations by a composite score.

Supported quality metrics

'silhouette': Silhouette coefficient (higher is better; range −1 to 1).
'dbi': Davies–Bouldin index (lower is better).
'ch': Calinski–Harabasz index (higher is better).
'dunn': Dunn index (higher is better).
'nmi': Normalised mutual information against manual labels, when available.
'aic', 'bic', 'log-likelihood': Gaussian mixture information criteria.

Public API

HDBSCANParams — HDBSCAN hyperparameter dataclass
DEFAULT_HDBSCAN_GRID — default parameter grid used by label_bird()
label_bird() — Stage D entry point
compute_scores() — evaluate cluster quality metrics
cluster_embeddings() — cluster a single embedding array

song_phenotyping.labelling.DEFAULT_HDBSCAN_GRID: List[HDBSCANParams] = [HDBSCANParams(min_cluster_size=5, min_samples=5), HDBSCANParams(min_cluster_size=5, min_samples=15), HDBSCANParams(min_cluster_size=20, min_samples=5), HDBSCANParams(min_cluster_size=20, min_samples=15), HDBSCANParams(min_cluster_size=60, min_samples=5), HDBSCANParams(min_cluster_size=60, min_samples=15)]: Default HDBSCAN parameter grid explored by label_bird(). Combines min_cluster_size in [5, 20, 60] with min_samples in [5, 15] (6 combinations total).

class song_phenotyping.labelling.HDBSCANParams(min_cluster_size=20, min_samples=5)[source]

Bases: object

Hyperparameters for a single HDBSCAN clustering run.

Parameters:

min_cluster_size (int, optional) – Minimum number of points required to form a cluster. Larger values produce fewer, broader clusters. Default is 20.
min_samples (int, optional) – Number of samples in the neighbourhood for a point to be considered a core point. Higher values make the algorithm more conservative (more points classified as noise). Default is 5.

Examples

>>> p = HDBSCANParams(min_cluster_size=10, min_samples=3)
>>> p.to_dict()
{'min_cluster_size': 10, 'min_samples': 3}

classmethod from_dict(data)[source]

Construct a HDBSCANParams from a dictionary.

Parameters:: data (dict) – Must contain min_cluster_size and min_samples.
Return type:: HDBSCANParams

min_cluster_size: int = 20

min_samples: int = 5

to_dict()[source]

Return parameters as a plain dictionary.

Returns:: Keys: min_cluster_size, min_samples.
Return type:: dict

song_phenotyping.labelling.aggregate_raw_scores_across_birds(save_path, birds=None, top_n=10)[source]

Aggregate raw clustering scores across multiple birds for cross-bird comparison.

Args:: save_path: Root directory containing bird subdirectories birds: List of bird names to include (all birds if None) top_n: Number of top parameter combinations to include per bird
Returns:: pd.DataFrame: Aggregated raw scores with bird identifiers

Parameters:

save_path (str)
birds (list)
top_n (int)

song_phenotyping.labelling.analyze_parameter_performance_by_sample_size(aggregated_df, sample_size_bins=None, metrics=None)[source]

Analyze how parameter combinations perform across different sample size ranges.

Args:: aggregated_df: DataFrame from aggregate_raw_scores_across_birds sample_size_bins: List of bin edges for sample size categorization metrics: List of metrics to analyze
Returns:: pd.DataFrame: Analysis results by sample size bins

song_phenotyping.labelling.clear_clustering_outputs(save_path, bird=None, confirm=True)[source]

Clear clustering label files and images to avoid clutter. Updated to work with new directory structure (syllable_data instead of data).

Args:: save_path: Root directory containing bird data bird: Specific bird to clear (all birds if None) confirm: Whether to ask for confirmation before deleting
Returns:: bool: True if successful, False otherwise

Parameters:

save_path (str)
bird (str)
confirm (bool)

song_phenotyping.labelling.cluster_embeddings(embeddings, hashes, method='hdbscan', cluster_params={}, return_scores=False, path_to_clusters=None, path_to_imgs=None, umap_id=None, true_labels=None, metrics=None)[source]

Cluster embeddings using a specified algorithm and hyperparameters.

Parameters:

embeddings (numpy.ndarray) – UMAP embedding coordinates, shape (n_samples, n_components).
hashes (array-like) – Per-sample hash identifiers used to cross-reference syllables across pipeline stages.
method ({'hdbscan', 'kmeans'}, optional) – Clustering algorithm. Default is 'hdbscan'.
cluster_params (dict, optional) – Keyword arguments forwarded directly to the chosen clustering algorithm (e.g. {'min_cluster_size': 10} for HDBSCAN).
return_scores (bool, optional) – When True, evaluate the resulting partition with metrics and include the scores in the return value. Default is False.
path_to_clusters (str or None, optional) – Directory where cluster label HDF5 files are saved. No files are written when None.
path_to_imgs (str or None, optional) – Directory where cluster scatter plots are saved. No images are written when None.
umap_id (str or None, optional) – Short identifier for the UMAP run (embedded in output filenames).
true_labels (numpy.ndarray or None, optional) – Ground-truth syllable labels used for NMI evaluation.
metrics (list of str or None, optional) – Metric names to compute when return_scores is True. See compute_scores() for supported values.

Returns:

labels (numpy.ndarray) – Cluster label per sample (-1 = noise for HDBSCAN).
hashes (numpy.ndarray) – Input hashes unchanged.
scores (dict or None) – Metric scores if return_scores is True, else None.
label_path (str or None) – Path of the saved label HDF5 file, or None if not saved.
plot_path (str or None) – Path of the saved scatter plot, or None if not saved.

song_phenotyping.labelling.compute_composite_score(summary_df, metrics, n_syls=None, weights=None, target_clusters=20, penalty_decay=0.1, use_cluster_penalty=False)[source]

Compute composite scores with normalization and optional cluster count penalty.

Args:: summary_df: DataFrame with raw metric scores metrics: List of metrics to include in composite score n_syls: List of cluster counts (extracted from summary_df if None) weights: Weights for each metric (equal weights if None) target_clusters: Target number of clusters for penalty function penalty_decay: Decay rate for cluster penalty use_cluster_penalty: Whether to apply cluster count penalty (default: False)
Returns:: pd.DataFrame: DataFrame with added normalized scores and composite score

Parameters:: n_syls (list)

song_phenotyping.labelling.compute_cross_bird_composite_scores(aggregated_df, metrics, weights=None)[source]

Compute new composite scores based on raw scores across all birds.

Args:: aggregated_df: DataFrame with raw scores from multiple birds metrics: List of metrics to include in composite score weights: Metric weights (equal if None)
Returns:: pd.DataFrame: DataFrame with cross-bird composite scores

song_phenotyping.labelling.compute_metric_ranking(summary_df, metrics)[source]

Compute individual metric rankings and aggregate rank.

Args:: summary_df: DataFrame with metric scores metrics: List of metrics to rank
Returns:: pd.DataFrame: DataFrame with added ranking columns

song_phenotyping.labelling.compute_scores(embeddings, labels, metrics, true_labels=None)[source]

Compute clustering evaluation metrics for given embeddings and labels.

Parameters:

embeddings (numpy.ndarray) – UMAP embedding coordinates, shape (n_samples, n_components).
labels (list) – Cluster label assigned to each sample.
metrics (list of str) – Names of metrics to compute. Supported values: 'silhouette', 'dbi', 'ch', 'dunn', 'nmi', 'aic', 'bic', 'log-likelihood'.
true_labels (list or None, optional) – Ground-truth syllable labels used for NMI calculation. When None, 'nmi' is set to nan.

Returns:

Mapping of metric name → score. Metrics that cannot be computed (e.g. only one cluster present) are set to numpy.nan.

Return type:

dict

song_phenotyping.labelling.create_cluster_summary_pdf(master_summary_df, bird, save_path, top_n=None)[source]

Create comprehensive PDF report with cluster visualizations and scores.

Args:: master_summary_df: DataFrame with all clustering results bird: Bird identifier save_path: Directory to save PDF top_n: Number of top results to include (all if None)
Returns:: bool: True if successful, False otherwise

Parameters:

bird (str)
save_path (str)
top_n (int)

song_phenotyping.labelling.dunn_index(embeddings, labels)[source]

Compute Dunn index for clustering evaluation. Higher values indicate better clustering (compact clusters, well-separated).

Args:: embeddings: UMAP embeddings array labels: Cluster labels
Returns:: float: Dunn index value

song_phenotyping.labelling.identify_optimal_parameters_by_sample_size(analysis_by_size_df)[source]

Identify which parameter combinations work best for different sample size ranges.

Args:: analysis_by_size_df: DataFrame from analyze_parameter_performance_by_sample_size
Returns:: dict: Recommendations by sample size bin

song_phenotyping.labelling.information_criterion(embeddings, labels, type=['aic', 'bic', 'log-likelihood'])[source]

Compute information criteria (AIC, BIC) and log-likelihood for clustering results.

Args:: embeddings: UMAP embeddings array labels: Cluster labels type: List of criteria to compute
Returns:: tuple: (log_likelihood, aic, bic) - None for uncomputed metrics

song_phenotyping.labelling.label_bird(save_path, bird, metrics, replace_labels=False, hdbscan_params=None, top_n_for_pdf=20, generate_cluster_pdf=False, metric_weights=None, max_workers=None, run_name='default')[source]

Run the complete Stage D labelling pipeline for one bird.

Iterates over all UMAP embedding files produced by Stage C, searches a grid of HDBSCAN hyperparameters for each embedding, evaluates the resulting clusterings with metrics, and writes a master CSV summary together with a PDF report of the top-top_n_for_pdf parameter combinations.

Parameters:

save_path (str) – Project root directory containing the bird subdirectory.
bird (str) – Bird identifier (e.g. 'or18or24').
metrics (list of str) – Evaluation metrics to compute for each clustering. Supported values: 'silhouette', 'dbi', 'ch', 'dunn', 'nmi', 'aic', 'bic', 'log-likelihood'.
replace_labels (bool, optional) – When True, delete any existing labelling/ directory and re-cluster from scratch. Default is False.
hdbscan_params (list of dict or None, optional) – Custom HDBSCAN parameter grid. Each element is a dict accepted by cluster_embeddings(). Uses DEFAULT_HDBSCAN_GRID when None.
top_n_for_pdf (int, optional) – Number of highest-scoring parameter combinations to include in the PDF report. Default is 20.
generate_cluster_pdf (bool, optional) – When True, write a PDF summary of the top clusterings to results/plots/. Default is False (opt-in, avoids matplotlib PDF overhead on every run).
metric_weights (dict or None, optional) – Per-metric weights for the composite score, e.g. {'silhouette': 2.0, 'dbi': 1.0}. None uses equal weights.
max_workers (int)
run_name (str)

Returns:

True on success, False if a fatal error occurred.

Return type:

bool