Embedding (Stage C)
UMAP dimensionality reduction for syllable spectrograms (Stage C).
Takes the flattened spectrogram feature matrices produced by Stage B and
reduces them to 2-D embeddings using UMAP. A grid of (min_dist,
n_neighbors) combinations is explored; each embedding is saved as an HDF5
file and the corresponding UMAP model is pickled for later projection of
new data.
Parallel processing is used by default, with adaptive worker-count and memory-limit logic to avoid OOM failures on large datasets.
Public API
UMAPParams— UMAP hyperparameter dataclassexplore_embedding_parameters_robust()— Stage C entry pointload_flattened_specs()— load Stage B output for downstream use
- class song_phenotyping.embedding.UMAPParams(n_neighbors=20, min_dist=0.5, metric='euclidean', n_components=2, n_epochs=10)[source]
Bases:
objectHyperparameters for a single UMAP embedding run.
- Parameters:
n_neighbors (int, optional) – Number of neighbouring points used in the local approximation of the manifold structure. Larger values produce a more global view; smaller values preserve finer local structure. Default is 20.
min_dist (float, optional) – Minimum distance between points in the 2-D embedding. Controls how tightly UMAP packs points together. Default is 0.5.
metric (str, optional) – Distance metric used in the high-dimensional input space. Default is
'euclidean'.n_components (int, optional) – Dimensionality of the embedding. Default is 2.
n_epochs (int, optional) – Number of training epochs. Increase for higher-quality embeddings at the cost of compute time. Default is 10.
- Raises:
ValueError – If any parameter is outside its valid range.
Examples
>>> params = UMAPParams(n_neighbors=10, min_dist=0.1) >>> params.to_dict() {'n_neighbors': 10, 'min_dist': 0.1, 'metric': 'euclidean', 'n_components': 2, 'n_epochs': 10}
- classmethod from_dict(data)[source]
Construct a
UMAPParamsfrom a dictionary.- Parameters:
- Return type:
- song_phenotyping.embedding.calculate_adaptive_workers_improved(n_samples, n_features, max_n_neighbors, memory_per_worker_gb=None, max_workers=None)[source]
Improved worker calculation with better memory estimation.
- song_phenotyping.embedding.calculate_safe_batch_size(available_memory_gb, n_features, max_n_neighbors, safety_factor=0.7)[source]
Calculate maximum safe batch size for UMAP given memory constraints.
- song_phenotyping.embedding.check_embedding_compatibility(embedding_path, current_n_samples, current_hashes, overwrite=False)[source]
Check if existing embedding is compatible with current data.
- Returns:
True if compatible or should proceed, False if should skip
- song_phenotyping.embedding.compare_umap_embeddings_plot(successful_params, min_dists, n_neighbors, paths, save_path, bird='', processing_metadata=None)[source]
Create comparison plot by loading embeddings on-demand. Updated to handle new filename structure with sample info.
- song_phenotyping.embedding.complex_spectrogram_distance(spec1, spec2)[source]
L2 distance between two complex spectrograms.
- Parameters:
spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
- Returns:
Frobenius norm of the element-wise difference.
- Return type:
- song_phenotyping.embedding.compute_and_save_umap_memory_aware(samples, labels, hashes, params, model_path, embedding_path, save_model=False, overwrite=False, processing_metadata=None)[source]
Memory-aware UMAP computation with fallback strategies.
- song_phenotyping.embedding.compute_embedding_grid_parallel_robust(samples, labels, hashes, min_dists, n_neighbors, paths, plot=True, bird='', max_workers=None, overwrite=False, processing_metadata=None)[source]
Robust parallel version with memory monitoring and graceful degradation.
- song_phenotyping.embedding.compute_single_umap_worker_safe(args)[source]
Enhanced UMAP worker with memory monitoring and fallback.
- song_phenotyping.embedding.estimate_umap_memory_usage(n_samples, n_features, n_neighbors)[source]
More accurate UMAP memory estimation based on algorithm internals.
Returns estimated peak memory usage in GB.
- song_phenotyping.embedding.explore_embedding_parameters_robust(save_path, bird, min_dists=None, n_neighbors_list=None, use_parallel=True, overwrite=False, max_samples=None, memory_per_worker_gb=None, auto_memory_management=True, subsample_seed=42, run_name='default', max_workers=None)[source]
Compute UMAP embeddings over a parameter grid for one bird (Stage C entry point).
Loads Stage B flattened spectrograms, optionally subsamples them to fit available memory, and runs UMAP for every
(min_dist, n_neighbors)combination. Each embedding is saved to<save_path>/<bird>/syllable_data/embeddings/<params>.h5and the corresponding model is pickled to<save_path>/<bird>/syllable_data/models/.- Parameters:
save_path (str) – Project root directory containing bird subdirectories.
bird (str) – Bird identifier (e.g.
'or18or24').min_dists (list of float, optional) –
min_distvalues to explore. Defaults to[0.01, 0.05, 0.1, 0.3, 0.5].n_neighbors_list (list of int, optional) –
n_neighborsvalues to explore. Defaults to[5, 10, 20, 50, 100].use_parallel (bool, optional) – Run parameter combinations in parallel using
ProcessPoolExecutor. Automatically falls back to sequential if only one worker is safe. Default isTrue.overwrite (bool, optional) – Recompute embeddings that already exist on disk. Default is
False.max_samples (int, optional) – Hard cap on syllables passed to UMAP.
Noneuses the memory-aware safe batch size.memory_per_worker_gb (float, optional) – Reserved memory per parallel worker in GB.
Noneuses automatic estimation.auto_memory_management (bool, optional) – Dynamically cap max_samples based on available RAM. Default is
True.subsample_seed (int, optional) – Random seed for reproducible subsampling. Default is 42.
run_name (str)
max_workers (int | None)
- Returns:
Trueif at least one embedding was computed successfully;Falseotherwise.- Return type:
See also
song_phenotyping.flattening.flatten_bird_spectrogramsStage B (produces input).
song_phenotyping.labelling.label_birdStage D (consumes output).
- song_phenotyping.embedding.generate_embedding_paths(paths, params, n_samples, was_subsampled, subsample_seed=None)[source]
Generate paths that include sample size information
- song_phenotyping.embedding.group_delay_distance(spec1, spec2)[source]
L2 distance between the group-delay representations of two spectrograms.
Group delay is computed as
-d(phase)/d(frequency)along the frequency axis (axis 0) after phase unwrapping.- Parameters:
spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
- Returns:
L2 norm of the group-delay difference.
- Return type:
- song_phenotyping.embedding.inspect_existing_embeddings(bird_path)[source]
Utility to inspect what embeddings already exist and their metadata
- Parameters:
bird_path (str)
- Return type:
None
- song_phenotyping.embedding.instantaneous_freq_distance(spec1, spec2)[source]
L2 distance between the instantaneous-frequency representations of two spectrograms.
Instantaneous frequency is computed as
d(phase)/d(time)along the time axis (axis 1) after phase unwrapping.- Parameters:
spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
- Returns:
L2 norm of the instantaneous-frequency difference.
- Return type:
- song_phenotyping.embedding.load_embedding_from_file(embedding_path)[source]
Load embeddings from HDF5 file
- song_phenotyping.embedding.load_flattened_specs(paths_to_specs)[source]
Load all Stage B flattened spectrogram files from a directory.
Concatenates all
flattened_*.h5files found in paths_to_specs into a single set of arrays. Files that cannot be read are skipped with a warning.- Parameters:
paths_to_specs (str) – Path to the
syllable_data/flattened/directory produced by Stage B.- Returns:
flattened_specs (numpy.ndarray, shape (n_features, n_syllables)) – Concatenated feature matrix.
labels (numpy.ndarray, shape (n_syllables,)) – Syllable labels decoded to
str.position_idxs (numpy.ndarray, shape (n_syllables,)) – Position indices within the original recording.
hashes (numpy.ndarray, shape (n_syllables,)) – Per-syllable hash strings for cross-stage tracking.
song_file_ids (numpy.ndarray, shape (n_syllables,)) – Integer index of the source HDF5 file for each syllable. All syllables from the same song file share the same value, enabling song-level subsampling.
- Raises:
ValueError – If no flattened files are found or no valid syllables can be loaded.
- Return type:
- song_phenotyping.embedding.phase_aware_spectrogram_distance(spec1, spec2)[source]
Distance metric combining magnitude cosine similarity and phase coherence.
Computes
1 - (0.8 * mag_sim + 0.2 * phase_coherence)so that more similar spectrograms yield smaller distances.- Parameters:
spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
- Returns:
Distance in [0, 1]; lower = more similar.
- Return type:
- song_phenotyping.embedding.save_umap_embeddings(embedding_path, embeddings, hashes, labels=None, metadata=None)[source]
Write a UMAP embedding array and associated metadata to HDF5.
- Parameters:
embedding_path (str) – Destination
.h5file path.embeddings (numpy.ndarray, shape (n_syllables, n_components)) – UMAP coordinates.
hashes (list of str) – Per-syllable hash strings (same order as rows of embeddings).
labels (list, optional) – Syllable labels. Stored under the
labelsnode if provided.metadata (dict, optional) – Arbitrary processing metadata stored as HDF5 attributes on the root node.
- Returns:
Trueon success,Falseif an error occurred.- Return type:
- song_phenotyping.embedding.save_umap_model(model_path, umap_model, params)[source]
Pickle a fitted UMAP model and its parameters to disk.
- Parameters:
model_path (str) – Destination
.pklfile path.umap_model (umap.UMAP) – A fitted UMAP model instance.
params (UMAPParams) – The hyperparameters used to fit the model.
- Returns:
Trueon success,Falseif an exception was raised.- Return type:
- song_phenotyping.embedding.subsample_by_song(specs, labels, position_idxs, hashes, song_file_ids, max_samples, random_seed=42)[source]
Subsample by selecting whole songs (files) until max_samples is reached.
Sampling whole songs preserves within-song syllable-type distributions far better than uniform random syllable sampling. Songs are shuffled with random_seed, then greedily added in that order until the next song would push the total over max_samples. If max_samples is already satisfied by the data, the arrays are returned unchanged.
- Parameters:
specs (numpy.ndarray, shape (n_features, n_syllables)) – Feature matrix (columns = syllables).
labels (numpy.ndarray, shape (n_syllables,)) – Per-syllable annotation arrays.
position_idxs (numpy.ndarray, shape (n_syllables,)) – Per-syllable annotation arrays.
hashes (numpy.ndarray, shape (n_syllables,)) – Per-syllable annotation arrays.
song_file_ids (numpy.ndarray, shape (n_syllables,)) – Integer file-index for each syllable (from
load_flattened_specs()).max_samples (int) – Target syllable budget.
random_seed (int, optional) – Seed for reproducible song shuffling. Default 42.
- Returns:
specs, labels, position_idxs, hashes – Subsampled arrays containing only syllables from selected songs, sorted in their original chronological order.
- Return type:
- song_phenotyping.embedding.subsample_data(specs, labels, position_idxs, hashes, max_samples, random_seed=42)[source]
Randomly subsample the data if it exceeds max_samples.
- Args:
specs, labels, position_idxs, hashes: Original data arrays max_samples: Maximum number of samples to keep random_seed: Random seed for reproducibility
- Returns:
Subsampled versions of input arrays