Labelling (Stage D)
HDBSCAN clustering and cluster quality evaluation (Stage D).
Takes UMAP embeddings produced by Stage C and clusters them using HDBSCAN
over a parameter grid. For each (embedding, HDBSCAN params) combination,
cluster labels and quality metrics are saved to HDF5 files. A PDF summary
report ranks parameter combinations by a composite score.
Supported quality metrics
'silhouette'Silhouette coefficient (higher is better; range −1 to 1).
'dbi'Davies–Bouldin index (lower is better).
'ch'Calinski–Harabasz index (higher is better).
'dunn'Dunn index (higher is better).
'nmi'Normalised mutual information against manual labels, when available.
'aic','bic','log-likelihood'Gaussian mixture information criteria.
Public API
HDBSCANParams— HDBSCAN hyperparameter dataclassDEFAULT_HDBSCAN_GRID— default parameter grid used bylabel_bird()label_bird()— Stage D entry pointcompute_scores()— evaluate cluster quality metricscluster_embeddings()— cluster a single embedding array
- song_phenotyping.labelling.DEFAULT_HDBSCAN_GRID: List[HDBSCANParams] = [HDBSCANParams(min_cluster_size=5, min_samples=5), HDBSCANParams(min_cluster_size=5, min_samples=15), HDBSCANParams(min_cluster_size=20, min_samples=5), HDBSCANParams(min_cluster_size=20, min_samples=15), HDBSCANParams(min_cluster_size=60, min_samples=5), HDBSCANParams(min_cluster_size=60, min_samples=15)]
Default HDBSCAN parameter grid explored by
label_bird(). Combinesmin_cluster_sizein[5, 20, 60]withmin_samplesin[5, 15](6 combinations total).
- class song_phenotyping.labelling.HDBSCANParams(min_cluster_size=20, min_samples=5)[source]
Bases:
objectHyperparameters for a single HDBSCAN clustering run.
- Parameters:
min_cluster_size (int, optional) – Minimum number of points required to form a cluster. Larger values produce fewer, broader clusters. Default is 20.
min_samples (int, optional) – Number of samples in the neighbourhood for a point to be considered a core point. Higher values make the algorithm more conservative (more points classified as noise). Default is 5.
Examples
>>> p = HDBSCANParams(min_cluster_size=10, min_samples=3) >>> p.to_dict() {'min_cluster_size': 10, 'min_samples': 3}
- classmethod from_dict(data)[source]
Construct a
HDBSCANParamsfrom a dictionary.- Parameters:
data (dict) – Must contain
min_cluster_sizeandmin_samples.- Return type:
- song_phenotyping.labelling.aggregate_raw_scores_across_birds(save_path, birds=None, top_n=10)[source]
Aggregate raw clustering scores across multiple birds for cross-bird comparison.
- Args:
save_path: Root directory containing bird subdirectories birds: List of bird names to include (all birds if None) top_n: Number of top parameter combinations to include per bird
- Returns:
pd.DataFrame: Aggregated raw scores with bird identifiers
- song_phenotyping.labelling.analyze_parameter_performance_by_sample_size(aggregated_df, sample_size_bins=None, metrics=None)[source]
Analyze how parameter combinations perform across different sample size ranges.
- Args:
aggregated_df: DataFrame from aggregate_raw_scores_across_birds sample_size_bins: List of bin edges for sample size categorization metrics: List of metrics to analyze
- Returns:
pd.DataFrame: Analysis results by sample size bins
- song_phenotyping.labelling.clear_clustering_outputs(save_path, bird=None, confirm=True)[source]
Clear clustering label files and images to avoid clutter. Updated to work with new directory structure (syllable_data instead of data).
- Args:
save_path: Root directory containing bird data bird: Specific bird to clear (all birds if None) confirm: Whether to ask for confirmation before deleting
- Returns:
bool: True if successful, False otherwise
- song_phenotyping.labelling.cluster_embeddings(embeddings, hashes, method='hdbscan', cluster_params={}, return_scores=False, path_to_clusters=None, path_to_imgs=None, umap_id=None, true_labels=None, metrics=None)[source]
Cluster embeddings using a specified algorithm and hyperparameters.
- Parameters:
embeddings (numpy.ndarray) – UMAP embedding coordinates, shape
(n_samples, n_components).hashes (array-like) – Per-sample hash identifiers used to cross-reference syllables across pipeline stages.
method ({'hdbscan', 'kmeans'}, optional) – Clustering algorithm. Default is
'hdbscan'.cluster_params (dict, optional) – Keyword arguments forwarded directly to the chosen clustering algorithm (e.g.
{'min_cluster_size': 10}for HDBSCAN).return_scores (bool, optional) – When
True, evaluate the resulting partition with metrics and include the scores in the return value. Default isFalse.path_to_clusters (str or None, optional) – Directory where cluster label HDF5 files are saved. No files are written when
None.path_to_imgs (str or None, optional) – Directory where cluster scatter plots are saved. No images are written when
None.umap_id (str or None, optional) – Short identifier for the UMAP run (embedded in output filenames).
true_labels (numpy.ndarray or None, optional) – Ground-truth syllable labels used for NMI evaluation.
metrics (list of str or None, optional) – Metric names to compute when return_scores is
True. Seecompute_scores()for supported values.
- Returns:
labels (numpy.ndarray) – Cluster label per sample (
-1= noise for HDBSCAN).hashes (numpy.ndarray) – Input hashes unchanged.
scores (dict or None) – Metric scores if return_scores is
True, elseNone.label_path (str or None) – Path of the saved label HDF5 file, or
Noneif not saved.plot_path (str or None) – Path of the saved scatter plot, or
Noneif not saved.
- song_phenotyping.labelling.compute_composite_score(summary_df, metrics, n_syls=None, weights=None, target_clusters=20, penalty_decay=0.1, use_cluster_penalty=False)[source]
Compute composite scores with normalization and optional cluster count penalty.
- Args:
summary_df: DataFrame with raw metric scores metrics: List of metrics to include in composite score n_syls: List of cluster counts (extracted from summary_df if None) weights: Weights for each metric (equal weights if None) target_clusters: Target number of clusters for penalty function penalty_decay: Decay rate for cluster penalty use_cluster_penalty: Whether to apply cluster count penalty (default: False)
- Returns:
pd.DataFrame: DataFrame with added normalized scores and composite score
- Parameters:
n_syls (list)
- song_phenotyping.labelling.compute_cross_bird_composite_scores(aggregated_df, metrics, weights=None)[source]
Compute new composite scores based on raw scores across all birds.
- Args:
aggregated_df: DataFrame with raw scores from multiple birds metrics: List of metrics to include in composite score weights: Metric weights (equal if None)
- Returns:
pd.DataFrame: DataFrame with cross-bird composite scores
- song_phenotyping.labelling.compute_metric_ranking(summary_df, metrics)[source]
Compute individual metric rankings and aggregate rank.
- Args:
summary_df: DataFrame with metric scores metrics: List of metrics to rank
- Returns:
pd.DataFrame: DataFrame with added ranking columns
- song_phenotyping.labelling.compute_scores(embeddings, labels, metrics, true_labels=None)[source]
Compute clustering evaluation metrics for given embeddings and labels.
- Parameters:
embeddings (numpy.ndarray) – UMAP embedding coordinates, shape
(n_samples, n_components).labels (list) – Cluster label assigned to each sample.
metrics (list of str) – Names of metrics to compute. Supported values:
'silhouette','dbi','ch','dunn','nmi','aic','bic','log-likelihood'.true_labels (list or None, optional) – Ground-truth syllable labels used for NMI calculation. When None,
'nmi'is set tonan.
- Returns:
Mapping of metric name → score. Metrics that cannot be computed (e.g. only one cluster present) are set to
numpy.nan.- Return type:
- song_phenotyping.labelling.create_cluster_summary_pdf(master_summary_df, bird, save_path, top_n=None)[source]
Create comprehensive PDF report with cluster visualizations and scores.
- Args:
master_summary_df: DataFrame with all clustering results bird: Bird identifier save_path: Directory to save PDF top_n: Number of top results to include (all if None)
- Returns:
bool: True if successful, False otherwise
- song_phenotyping.labelling.dunn_index(embeddings, labels)[source]
Compute Dunn index for clustering evaluation. Higher values indicate better clustering (compact clusters, well-separated).
- Args:
embeddings: UMAP embeddings array labels: Cluster labels
- Returns:
float: Dunn index value
- song_phenotyping.labelling.identify_optimal_parameters_by_sample_size(analysis_by_size_df)[source]
Identify which parameter combinations work best for different sample size ranges.
- Args:
analysis_by_size_df: DataFrame from analyze_parameter_performance_by_sample_size
- Returns:
dict: Recommendations by sample size bin
- song_phenotyping.labelling.information_criterion(embeddings, labels, type=['aic', 'bic', 'log-likelihood'])[source]
Compute information criteria (AIC, BIC) and log-likelihood for clustering results.
- Args:
embeddings: UMAP embeddings array labels: Cluster labels type: List of criteria to compute
- Returns:
tuple: (log_likelihood, aic, bic) - None for uncomputed metrics
- song_phenotyping.labelling.label_bird(save_path, bird, metrics, replace_labels=False, hdbscan_params=None, top_n_for_pdf=20, generate_cluster_pdf=False, metric_weights=None, max_workers=None, run_name='default')[source]
Run the complete Stage D labelling pipeline for one bird.
Iterates over all UMAP embedding files produced by Stage C, searches a grid of HDBSCAN hyperparameters for each embedding, evaluates the resulting clusterings with metrics, and writes a master CSV summary together with a PDF report of the top-top_n_for_pdf parameter combinations.
- Parameters:
save_path (str) – Project root directory containing the bird subdirectory.
bird (str) – Bird identifier (e.g.
'or18or24').metrics (list of str) – Evaluation metrics to compute for each clustering. Supported values:
'silhouette','dbi','ch','dunn','nmi','aic','bic','log-likelihood'.replace_labels (bool, optional) – When
True, delete any existinglabelling/directory and re-cluster from scratch. Default isFalse.hdbscan_params (list of dict or None, optional) – Custom HDBSCAN parameter grid. Each element is a dict accepted by
cluster_embeddings(). UsesDEFAULT_HDBSCAN_GRIDwhenNone.top_n_for_pdf (int, optional) – Number of highest-scoring parameter combinations to include in the PDF report. Default is
20.generate_cluster_pdf (bool, optional) – When
True, write a PDF summary of the top clusterings toresults/plots/. Default isFalse(opt-in, avoids matplotlib PDF overhead on every run).metric_weights (dict or None, optional) – Per-metric weights for the composite score, e.g.
{'silhouette': 2.0, 'dbi': 1.0}.Noneuses equal weights.max_workers (int)
run_name (str)
- Returns:
Trueon success,Falseif a fatal error occurred.- Return type:
See also
song_phenotyping.embedding.explore_embedding_parameters_robustStage C (produces input).
song_phenotyping.phenotyping.phenotype_birdStage E (consumes output).
- song_phenotyping.labelling.load_labels(label_save_path)[source]
Load cluster labels, hashes, and evaluation scores from HDF5 file.
- Args:
label_save_path: Path to HDF5 file
- Returns:
tuple: (labels, hashes, scores) or (None, None, None) on error
- Parameters:
label_save_path (str)
- song_phenotyping.labelling.load_master_summary(bird_path)[source]
Load master summary DataFrame from CSV.
- Args:
bird_path: Bird root directory
- Returns:
pd.DataFrame: Loaded DataFrame or empty DataFrame if not found
- Parameters:
bird_path (str)
- song_phenotyping.labelling.load_umap_embeddings(embedding_path)[source]
Load embeddings, hashes, and labels from HDF5 file using PyTables.
- Args:
embedding_path: Path to HDF5 embedding file
- Returns:
tuple: (embeddings, hashes, labels) or (None, None, None) on error
- Parameters:
embedding_path (str)
- song_phenotyping.labelling.main(save_path)[source]
Main function to run the clustering pipeline.
- Parameters:
save_path (str)
- Return type:
None
- song_phenotyping.labelling.parse_embedding_filename(filename)[source]
Parse UMAP parameters from embedding filename. Updated to handle new filename format with sample info.
- Args:
filename: Embedding filename (e.g., ‘euclidean_10neighbors_0.1dist_full3301.h5’)
- Returns:
tuple: (metric, n_neighbors, min_dist, umap_id) or None on error
- Parameters:
filename (str)
- song_phenotyping.labelling.plot_summary_matrix(summary_df, save_path, metrics, figsize=(12, 8))[source]
Create heatmap visualization of normalized metric scores.
- Args:
summary_df: DataFrame with normalized scores save_path: Path to save the plot metrics: List of metrics to include figsize: Figure size tuple
- Returns:
None
- song_phenotyping.labelling.plot_umap(embeddings, labels, save=None, figsize=(12, 10), title=None)[source]
Create scatter plot of UMAP embeddings colored by cluster labels.
- Args:
embeddings: UMAP embeddings array (N x 2) labels: Cluster labels save: Path to save plot (optional) figsize: Figure size tuple title: Plot title (optional)
- Returns:
None
- Parameters:
save (str)
- song_phenotyping.labelling.remove_directory(path)[source]
Safely remove directory and all contents.
- Args:
path: Directory path to remove
- Returns:
bool: True if successful, False otherwise
- Parameters:
path (str)
- song_phenotyping.labelling.reorder_columns(master_summary_df, metrics)[source]
Reorder DataFrame columns for better readability and analysis.
- Args:
master_summary_df: DataFrame to reorder metrics: List of metrics used
- Returns:
pd.DataFrame: Reordered DataFrame
- Parameters:
master_summary_df (DataFrame)
metrics (list)
- song_phenotyping.labelling.save_cross_bird_analysis(aggregated_df, analysis_by_size, save_path, filename_prefix='cross_bird')[source]
Save cross-bird analysis results to CSV files.
- Args:
aggregated_df: Aggregated raw scores DataFrame analysis_by_size: Sample size analysis DataFrame save_path: Directory to save files filename_prefix: Prefix for output filenames
- Returns:
dict: Paths to saved files
- song_phenotyping.labelling.save_labels(label_save_path, labels, hashes, scores)[source]
Save cluster labels, hashes, and evaluation scores to HDF5 file.
- Args:
label_save_path: Path where to save the HDF5 file labels: Cluster labels array hashes: Sample hash identifiers scores: Dictionary of evaluation scores
- Returns:
bool: True if successful, False otherwise
- Parameters:
label_save_path (str)
- song_phenotyping.labelling.save_master_summary(master_summary_df, bird_path)[source]
Save master summary DataFrame to CSV under the results/ directory.
- Args:
master_summary_df: DataFrame to save bird_path: Bird root directory
- Returns:
bool: True if successful, False otherwise
- Parameters:
bird_path (str)
- song_phenotyping.labelling.score_cluster_penalty(n, target=20, decay_rate=0.1)[source]
Compute penalty for cluster count deviation from target.
- Args:
n: Number of clusters target: Target number of clusters decay_rate: Exponential decay rate for penalty
- Returns:
float: Penalty multiplier (1.0 = no penalty, <1.0 = penalty applied)
- song_phenotyping.labelling.search_cluster_params(embeddings, hashes, algorithm, umap_id, directory_path, figure_path, candidate_params=None, true_labels=None, metrics=None, sample_size=None, use_parallel=True, max_workers=None)[source]
Search through candidate clustering parameters and evaluate performance.
- Args:
embeddings: UMAP embeddings array hashes: Sample hash identifiers algorithm: Clustering algorithm name umap_id: UMAP parameter identifier directory_path: Base directory for saving results figure_path: Directory for saving plots candidate_params: List of parameter dictionaries to test true_labels: Ground truth labels for evaluation metrics: List of evaluation metrics to compute sample_size: Total number of samples (for efficiency, pass if known) use_parallel: Whether to use parallel processing (default: True) max_workers: Maximum number of parallel workers (default: CPU count)
- Returns:
pd.DataFrame: Summary of results for all parameter combinations
- song_phenotyping.labelling.select_best_params(summary_df, selection_method='composite', top_n=1)[source]
Select best parameter combinations based on specified method.
- Args:
summary_df: DataFrame with scores and composite scores selection_method: Method for selection (‘composite’ or specific metric name) top_n: Number of top parameter combinations to return
- Returns:
pd.DataFrame: Top N parameter combinations sorted by selection criteria