Labelling (Stage D)

HDBSCAN clustering and cluster quality evaluation (Stage D).

Takes UMAP embeddings produced by Stage C and clusters them using HDBSCAN over a parameter grid. For each (embedding, HDBSCAN params) combination, cluster labels and quality metrics are saved to HDF5 files. A PDF summary report ranks parameter combinations by a composite score.

Supported quality metrics

'silhouette'

Silhouette coefficient (higher is better; range −1 to 1).

'dbi'

Davies–Bouldin index (lower is better).

'ch'

Calinski–Harabasz index (higher is better).

'dunn'

Dunn index (higher is better).

'nmi'

Normalised mutual information against manual labels, when available.

'aic', 'bic', 'log-likelihood'

Gaussian mixture information criteria.

Public API

song_phenotyping.labelling.DEFAULT_HDBSCAN_GRID: List[HDBSCANParams] = [HDBSCANParams(min_cluster_size=5, min_samples=5), HDBSCANParams(min_cluster_size=5, min_samples=15), HDBSCANParams(min_cluster_size=20, min_samples=5), HDBSCANParams(min_cluster_size=20, min_samples=15), HDBSCANParams(min_cluster_size=60, min_samples=5), HDBSCANParams(min_cluster_size=60, min_samples=15)]

Default HDBSCAN parameter grid explored by label_bird(). Combines min_cluster_size in [5, 20, 60] with min_samples in [5, 15] (6 combinations total).

class song_phenotyping.labelling.HDBSCANParams(min_cluster_size=20, min_samples=5)[source]

Bases: object

Hyperparameters for a single HDBSCAN clustering run.

Parameters:
  • min_cluster_size (int, optional) – Minimum number of points required to form a cluster. Larger values produce fewer, broader clusters. Default is 20.

  • min_samples (int, optional) – Number of samples in the neighbourhood for a point to be considered a core point. Higher values make the algorithm more conservative (more points classified as noise). Default is 5.

Examples

>>> p = HDBSCANParams(min_cluster_size=10, min_samples=3)
>>> p.to_dict()
{'min_cluster_size': 10, 'min_samples': 3}
classmethod from_dict(data)[source]

Construct a HDBSCANParams from a dictionary.

Parameters:

data (dict) – Must contain min_cluster_size and min_samples.

Return type:

HDBSCANParams

min_cluster_size: int = 20
min_samples: int = 5
to_dict()[source]

Return parameters as a plain dictionary.

Returns:

Keys: min_cluster_size, min_samples.

Return type:

dict

song_phenotyping.labelling.aggregate_raw_scores_across_birds(save_path, birds=None, top_n=10)[source]

Aggregate raw clustering scores across multiple birds for cross-bird comparison.

Args:

save_path: Root directory containing bird subdirectories birds: List of bird names to include (all birds if None) top_n: Number of top parameter combinations to include per bird

Returns:

pd.DataFrame: Aggregated raw scores with bird identifiers

Parameters:
song_phenotyping.labelling.analyze_parameter_performance_by_sample_size(aggregated_df, sample_size_bins=None, metrics=None)[source]

Analyze how parameter combinations perform across different sample size ranges.

Args:

aggregated_df: DataFrame from aggregate_raw_scores_across_birds sample_size_bins: List of bin edges for sample size categorization metrics: List of metrics to analyze

Returns:

pd.DataFrame: Analysis results by sample size bins

song_phenotyping.labelling.clear_clustering_outputs(save_path, bird=None, confirm=True)[source]

Clear clustering label files and images to avoid clutter. Updated to work with new directory structure (syllable_data instead of data).

Args:

save_path: Root directory containing bird data bird: Specific bird to clear (all birds if None) confirm: Whether to ask for confirmation before deleting

Returns:

bool: True if successful, False otherwise

Parameters:
song_phenotyping.labelling.cluster_embeddings(embeddings, hashes, method='hdbscan', cluster_params={}, return_scores=False, path_to_clusters=None, path_to_imgs=None, umap_id=None, true_labels=None, metrics=None)[source]

Cluster embeddings using a specified algorithm and hyperparameters.

Parameters:
  • embeddings (numpy.ndarray) – UMAP embedding coordinates, shape (n_samples, n_components).

  • hashes (array-like) – Per-sample hash identifiers used to cross-reference syllables across pipeline stages.

  • method ({'hdbscan', 'kmeans'}, optional) – Clustering algorithm. Default is 'hdbscan'.

  • cluster_params (dict, optional) – Keyword arguments forwarded directly to the chosen clustering algorithm (e.g. {'min_cluster_size': 10} for HDBSCAN).

  • return_scores (bool, optional) – When True, evaluate the resulting partition with metrics and include the scores in the return value. Default is False.

  • path_to_clusters (str or None, optional) – Directory where cluster label HDF5 files are saved. No files are written when None.

  • path_to_imgs (str or None, optional) – Directory where cluster scatter plots are saved. No images are written when None.

  • umap_id (str or None, optional) – Short identifier for the UMAP run (embedded in output filenames).

  • true_labels (numpy.ndarray or None, optional) – Ground-truth syllable labels used for NMI evaluation.

  • metrics (list of str or None, optional) – Metric names to compute when return_scores is True. See compute_scores() for supported values.

Returns:

  • labels (numpy.ndarray) – Cluster label per sample (-1 = noise for HDBSCAN).

  • hashes (numpy.ndarray) – Input hashes unchanged.

  • scores (dict or None) – Metric scores if return_scores is True, else None.

  • label_path (str or None) – Path of the saved label HDF5 file, or None if not saved.

  • plot_path (str or None) – Path of the saved scatter plot, or None if not saved.

song_phenotyping.labelling.compute_composite_score(summary_df, metrics, n_syls=None, weights=None, target_clusters=20, penalty_decay=0.1, use_cluster_penalty=False)[source]

Compute composite scores with normalization and optional cluster count penalty.

Args:

summary_df: DataFrame with raw metric scores metrics: List of metrics to include in composite score n_syls: List of cluster counts (extracted from summary_df if None) weights: Weights for each metric (equal weights if None) target_clusters: Target number of clusters for penalty function penalty_decay: Decay rate for cluster penalty use_cluster_penalty: Whether to apply cluster count penalty (default: False)

Returns:

pd.DataFrame: DataFrame with added normalized scores and composite score

Parameters:

n_syls (list)

song_phenotyping.labelling.compute_cross_bird_composite_scores(aggregated_df, metrics, weights=None)[source]

Compute new composite scores based on raw scores across all birds.

Args:

aggregated_df: DataFrame with raw scores from multiple birds metrics: List of metrics to include in composite score weights: Metric weights (equal if None)

Returns:

pd.DataFrame: DataFrame with cross-bird composite scores

song_phenotyping.labelling.compute_metric_ranking(summary_df, metrics)[source]

Compute individual metric rankings and aggregate rank.

Args:

summary_df: DataFrame with metric scores metrics: List of metrics to rank

Returns:

pd.DataFrame: DataFrame with added ranking columns

song_phenotyping.labelling.compute_scores(embeddings, labels, metrics, true_labels=None)[source]

Compute clustering evaluation metrics for given embeddings and labels.

Parameters:
  • embeddings (numpy.ndarray) – UMAP embedding coordinates, shape (n_samples, n_components).

  • labels (list) – Cluster label assigned to each sample.

  • metrics (list of str) – Names of metrics to compute. Supported values: 'silhouette', 'dbi', 'ch', 'dunn', 'nmi', 'aic', 'bic', 'log-likelihood'.

  • true_labels (list or None, optional) – Ground-truth syllable labels used for NMI calculation. When None, 'nmi' is set to nan.

Returns:

Mapping of metric name → score. Metrics that cannot be computed (e.g. only one cluster present) are set to numpy.nan.

Return type:

dict

song_phenotyping.labelling.create_cluster_summary_pdf(master_summary_df, bird, save_path, top_n=None)[source]

Create comprehensive PDF report with cluster visualizations and scores.

Args:

master_summary_df: DataFrame with all clustering results bird: Bird identifier save_path: Directory to save PDF top_n: Number of top results to include (all if None)

Returns:

bool: True if successful, False otherwise

Parameters:
song_phenotyping.labelling.dunn_index(embeddings, labels)[source]

Compute Dunn index for clustering evaluation. Higher values indicate better clustering (compact clusters, well-separated).

Args:

embeddings: UMAP embeddings array labels: Cluster labels

Returns:

float: Dunn index value

song_phenotyping.labelling.identify_optimal_parameters_by_sample_size(analysis_by_size_df)[source]

Identify which parameter combinations work best for different sample size ranges.

Args:

analysis_by_size_df: DataFrame from analyze_parameter_performance_by_sample_size

Returns:

dict: Recommendations by sample size bin

song_phenotyping.labelling.information_criterion(embeddings, labels, type=['aic', 'bic', 'log-likelihood'])[source]

Compute information criteria (AIC, BIC) and log-likelihood for clustering results.

Args:

embeddings: UMAP embeddings array labels: Cluster labels type: List of criteria to compute

Returns:

tuple: (log_likelihood, aic, bic) - None for uncomputed metrics

song_phenotyping.labelling.label_bird(save_path, bird, metrics, replace_labels=False, hdbscan_params=None, top_n_for_pdf=20, generate_cluster_pdf=False, metric_weights=None, max_workers=None, run_name='default')[source]

Run the complete Stage D labelling pipeline for one bird.

Iterates over all UMAP embedding files produced by Stage C, searches a grid of HDBSCAN hyperparameters for each embedding, evaluates the resulting clusterings with metrics, and writes a master CSV summary together with a PDF report of the top-top_n_for_pdf parameter combinations.

Parameters:
  • save_path (str) – Project root directory containing the bird subdirectory.

  • bird (str) – Bird identifier (e.g. 'or18or24').

  • metrics (list of str) – Evaluation metrics to compute for each clustering. Supported values: 'silhouette', 'dbi', 'ch', 'dunn', 'nmi', 'aic', 'bic', 'log-likelihood'.

  • replace_labels (bool, optional) – When True, delete any existing labelling/ directory and re-cluster from scratch. Default is False.

  • hdbscan_params (list of dict or None, optional) – Custom HDBSCAN parameter grid. Each element is a dict accepted by cluster_embeddings(). Uses DEFAULT_HDBSCAN_GRID when None.

  • top_n_for_pdf (int, optional) – Number of highest-scoring parameter combinations to include in the PDF report. Default is 20.

  • generate_cluster_pdf (bool, optional) – When True, write a PDF summary of the top clusterings to results/plots/. Default is False (opt-in, avoids matplotlib PDF overhead on every run).

  • metric_weights (dict or None, optional) – Per-metric weights for the composite score, e.g. {'silhouette': 2.0, 'dbi': 1.0}. None uses equal weights.

  • max_workers (int)

  • run_name (str)

Returns:

True on success, False if a fatal error occurred.

Return type:

bool

song_phenotyping.labelling.load_labels(label_save_path)[source]

Load cluster labels, hashes, and evaluation scores from HDF5 file.

Args:

label_save_path: Path to HDF5 file

Returns:

tuple: (labels, hashes, scores) or (None, None, None) on error

Parameters:

label_save_path (str)

song_phenotyping.labelling.load_master_summary(bird_path)[source]

Load master summary DataFrame from CSV.

Args:

bird_path: Bird root directory

Returns:

pd.DataFrame: Loaded DataFrame or empty DataFrame if not found

Parameters:

bird_path (str)

song_phenotyping.labelling.load_umap_embeddings(embedding_path)[source]

Load embeddings, hashes, and labels from HDF5 file using PyTables.

Args:

embedding_path: Path to HDF5 embedding file

Returns:

tuple: (embeddings, hashes, labels) or (None, None, None) on error

Parameters:

embedding_path (str)

song_phenotyping.labelling.main(save_path)[source]

Main function to run the clustering pipeline.

Parameters:

save_path (str)

Return type:

None

song_phenotyping.labelling.parse_embedding_filename(filename)[source]

Parse UMAP parameters from embedding filename. Updated to handle new filename format with sample info.

Args:

filename: Embedding filename (e.g., ‘euclidean_10neighbors_0.1dist_full3301.h5’)

Returns:

tuple: (metric, n_neighbors, min_dist, umap_id) or None on error

Parameters:

filename (str)

song_phenotyping.labelling.plot_summary_matrix(summary_df, save_path, metrics, figsize=(12, 8))[source]

Create heatmap visualization of normalized metric scores.

Args:

summary_df: DataFrame with normalized scores save_path: Path to save the plot metrics: List of metrics to include figsize: Figure size tuple

Returns:

None

Parameters:
song_phenotyping.labelling.plot_umap(embeddings, labels, save=None, figsize=(12, 10), title=None)[source]

Create scatter plot of UMAP embeddings colored by cluster labels.

Args:

embeddings: UMAP embeddings array (N x 2) labels: Cluster labels save: Path to save plot (optional) figsize: Figure size tuple title: Plot title (optional)

Returns:

None

Parameters:

save (str)

song_phenotyping.labelling.remove_directory(path)[source]

Safely remove directory and all contents.

Args:

path: Directory path to remove

Returns:

bool: True if successful, False otherwise

Parameters:

path (str)

song_phenotyping.labelling.reorder_columns(master_summary_df, metrics)[source]

Reorder DataFrame columns for better readability and analysis.

Args:

master_summary_df: DataFrame to reorder metrics: List of metrics used

Returns:

pd.DataFrame: Reordered DataFrame

Parameters:
  • master_summary_df (DataFrame)

  • metrics (list)

song_phenotyping.labelling.save_cross_bird_analysis(aggregated_df, analysis_by_size, save_path, filename_prefix='cross_bird')[source]

Save cross-bird analysis results to CSV files.

Args:

aggregated_df: Aggregated raw scores DataFrame analysis_by_size: Sample size analysis DataFrame save_path: Directory to save files filename_prefix: Prefix for output filenames

Returns:

dict: Paths to saved files

Parameters:
  • save_path (str)

  • filename_prefix (str)

song_phenotyping.labelling.save_labels(label_save_path, labels, hashes, scores)[source]

Save cluster labels, hashes, and evaluation scores to HDF5 file.

Args:

label_save_path: Path where to save the HDF5 file labels: Cluster labels array hashes: Sample hash identifiers scores: Dictionary of evaluation scores

Returns:

bool: True if successful, False otherwise

Parameters:

label_save_path (str)

song_phenotyping.labelling.save_master_summary(master_summary_df, bird_path)[source]

Save master summary DataFrame to CSV under the results/ directory.

Args:

master_summary_df: DataFrame to save bird_path: Bird root directory

Returns:

bool: True if successful, False otherwise

Parameters:

bird_path (str)

song_phenotyping.labelling.score_cluster_penalty(n, target=20, decay_rate=0.1)[source]

Compute penalty for cluster count deviation from target.

Args:

n: Number of clusters target: Target number of clusters decay_rate: Exponential decay rate for penalty

Returns:

float: Penalty multiplier (1.0 = no penalty, <1.0 = penalty applied)

song_phenotyping.labelling.search_cluster_params(embeddings, hashes, algorithm, umap_id, directory_path, figure_path, candidate_params=None, true_labels=None, metrics=None, sample_size=None, use_parallel=True, max_workers=None)[source]

Search through candidate clustering parameters and evaluate performance.

Args:

embeddings: UMAP embeddings array hashes: Sample hash identifiers algorithm: Clustering algorithm name umap_id: UMAP parameter identifier directory_path: Base directory for saving results figure_path: Directory for saving plots candidate_params: List of parameter dictionaries to test true_labels: Ground truth labels for evaluation metrics: List of evaluation metrics to compute sample_size: Total number of samples (for efficiency, pass if known) use_parallel: Whether to use parallel processing (default: True) max_workers: Maximum number of parallel workers (default: CPU count)

Returns:

pd.DataFrame: Summary of results for all parameter combinations

Parameters:
song_phenotyping.labelling.select_best_params(summary_df, selection_method='composite', top_n=1)[source]

Select best parameter combinations based on specified method.

Args:

summary_df: DataFrame with scores and composite scores selection_method: Method for selection (‘composite’ or specific metric name) top_n: Number of top parameter combinations to return

Returns:

pd.DataFrame: Top N parameter combinations sorted by selection criteria