Embedding (Stage C)

UMAP dimensionality reduction for syllable spectrograms (Stage C).

Takes the flattened spectrogram feature matrices produced by Stage B and reduces them to 2-D embeddings using UMAP. A grid of (min_dist, n_neighbors) combinations is explored; each embedding is saved as an HDF5 file and the corresponding UMAP model is pickled for later projection of new data.

Parallel processing is used by default, with adaptive worker-count and memory-limit logic to avoid OOM failures on large datasets.

Public API

class song_phenotyping.embedding.UMAPParams(n_neighbors=20, min_dist=0.5, metric='euclidean', n_components=2, n_epochs=10)[source]

Bases: object

Hyperparameters for a single UMAP embedding run.

Parameters:
  • n_neighbors (int, optional) – Number of neighbouring points used in the local approximation of the manifold structure. Larger values produce a more global view; smaller values preserve finer local structure. Default is 20.

  • min_dist (float, optional) – Minimum distance between points in the 2-D embedding. Controls how tightly UMAP packs points together. Default is 0.5.

  • metric (str, optional) – Distance metric used in the high-dimensional input space. Default is 'euclidean'.

  • n_components (int, optional) – Dimensionality of the embedding. Default is 2.

  • n_epochs (int, optional) – Number of training epochs. Increase for higher-quality embeddings at the cost of compute time. Default is 10.

Raises:

ValueError – If any parameter is outside its valid range.

Examples

>>> params = UMAPParams(n_neighbors=10, min_dist=0.1)
>>> params.to_dict()
{'n_neighbors': 10, 'min_dist': 0.1, 'metric': 'euclidean', 'n_components': 2, 'n_epochs': 10}
classmethod from_dict(data)[source]

Construct a UMAPParams from a dictionary.

Parameters:

data (dict) – Must contain the keys produced by to_dict().

Return type:

UMAPParams

metric: str = 'euclidean'
min_dist: float = 0.5
n_components: int = 2
n_epochs: int = 10
n_neighbors: int = 20
to_dict()[source]

Return all parameters as a plain dictionary.

Returns:

Keys: n_neighbors, min_dist, metric, n_components, n_epochs.

Return type:

dict

validate_params()[source]

Raise ValueError if any parameter is out of range.

song_phenotyping.embedding.calculate_adaptive_workers_improved(n_samples, n_features, max_n_neighbors, memory_per_worker_gb=None, max_workers=None)[source]

Improved worker calculation with better memory estimation.

Parameters:
  • n_samples (int)

  • n_features (int)

  • max_n_neighbors (int)

  • memory_per_worker_gb (float)

  • max_workers (int | None)

Return type:

int

song_phenotyping.embedding.calculate_safe_batch_size(available_memory_gb, n_features, max_n_neighbors, safety_factor=0.7)[source]

Calculate maximum safe batch size for UMAP given memory constraints.

Parameters:
  • available_memory_gb (float)

  • n_features (int)

  • max_n_neighbors (int)

  • safety_factor (float)

Return type:

int

song_phenotyping.embedding.check_embedding_compatibility(embedding_path, current_n_samples, current_hashes, overwrite=False)[source]

Check if existing embedding is compatible with current data.

Returns:

True if compatible or should proceed, False if should skip

Parameters:
  • embedding_path (str)

  • current_n_samples (int)

  • current_hashes (list)

  • overwrite (bool)

Return type:

bool

song_phenotyping.embedding.compare_umap_embeddings_plot(successful_params, min_dists, n_neighbors, paths, save_path, bird='', processing_metadata=None)[source]

Create comparison plot by loading embeddings on-demand. Updated to handle new filename structure with sample info.

Parameters:
song_phenotyping.embedding.complex_spectrogram_distance(spec1, spec2)[source]

L2 distance between two complex spectrograms.

Parameters:
  • spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

  • spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

Returns:

Frobenius norm of the element-wise difference.

Return type:

float

song_phenotyping.embedding.compute_and_save_umap_memory_aware(samples, labels, hashes, params, model_path, embedding_path, save_model=False, overwrite=False, processing_metadata=None)[source]

Memory-aware UMAP computation with fallback strategies.

Parameters:
Return type:

Tuple[ndarray | None, umap.UMAP | None, bool]

song_phenotyping.embedding.compute_embedding_grid_parallel_robust(samples, labels, hashes, min_dists, n_neighbors, paths, plot=True, bird='', max_workers=None, overwrite=False, processing_metadata=None)[source]

Robust parallel version with memory monitoring and graceful degradation.

Parameters:
song_phenotyping.embedding.compute_single_umap_worker_safe(args)[source]

Enhanced UMAP worker with memory monitoring and fallback.

song_phenotyping.embedding.estimate_umap_memory_usage(n_samples, n_features, n_neighbors)[source]

More accurate UMAP memory estimation based on algorithm internals.

Returns estimated peak memory usage in GB.

Parameters:
  • n_samples (int)

  • n_features (int)

  • n_neighbors (int)

Return type:

float

song_phenotyping.embedding.explore_embedding_parameters_robust(save_path, bird, min_dists=None, n_neighbors_list=None, use_parallel=True, overwrite=False, max_samples=None, memory_per_worker_gb=None, auto_memory_management=True, subsample_seed=42, run_name='default', max_workers=None)[source]

Compute UMAP embeddings over a parameter grid for one bird (Stage C entry point).

Loads Stage B flattened spectrograms, optionally subsamples them to fit available memory, and runs UMAP for every (min_dist, n_neighbors) combination. Each embedding is saved to <save_path>/<bird>/syllable_data/embeddings/<params>.h5 and the corresponding model is pickled to <save_path>/<bird>/syllable_data/models/.

Parameters:
  • save_path (str) – Project root directory containing bird subdirectories.

  • bird (str) – Bird identifier (e.g. 'or18or24').

  • min_dists (list of float, optional) – min_dist values to explore. Defaults to [0.01, 0.05, 0.1, 0.3, 0.5].

  • n_neighbors_list (list of int, optional) – n_neighbors values to explore. Defaults to [5, 10, 20, 50, 100].

  • use_parallel (bool, optional) – Run parameter combinations in parallel using ProcessPoolExecutor. Automatically falls back to sequential if only one worker is safe. Default is True.

  • overwrite (bool, optional) – Recompute embeddings that already exist on disk. Default is False.

  • max_samples (int, optional) – Hard cap on syllables passed to UMAP. None uses the memory-aware safe batch size.

  • memory_per_worker_gb (float, optional) – Reserved memory per parallel worker in GB. None uses automatic estimation.

  • auto_memory_management (bool, optional) – Dynamically cap max_samples based on available RAM. Default is True.

  • subsample_seed (int, optional) – Random seed for reproducible subsampling. Default is 42.

  • run_name (str)

  • max_workers (int | None)

Returns:

True if at least one embedding was computed successfully; False otherwise.

Return type:

bool

See also

song_phenotyping.flattening.flatten_bird_spectrograms

Stage B (produces input).

song_phenotyping.labelling.label_bird

Stage D (consumes output).

song_phenotyping.embedding.generate_embedding_paths(paths, params, n_samples, was_subsampled, subsample_seed=None)[source]

Generate paths that include sample size information

Parameters:
Return type:

Tuple[str, str]

song_phenotyping.embedding.group_delay_distance(spec1, spec2)[source]

L2 distance between the group-delay representations of two spectrograms.

Group delay is computed as -d(phase)/d(frequency) along the frequency axis (axis 0) after phase unwrapping.

Parameters:
  • spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

  • spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

Returns:

L2 norm of the group-delay difference.

Return type:

float

song_phenotyping.embedding.inspect_existing_embeddings(bird_path)[source]

Utility to inspect what embeddings already exist and their metadata

Parameters:

bird_path (str)

Return type:

None

song_phenotyping.embedding.instantaneous_freq_distance(spec1, spec2)[source]

L2 distance between the instantaneous-frequency representations of two spectrograms.

Instantaneous frequency is computed as d(phase)/d(time) along the time axis (axis 1) after phase unwrapping.

Parameters:
  • spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

  • spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

Returns:

L2 norm of the instantaneous-frequency difference.

Return type:

float

song_phenotyping.embedding.load_embedding_from_file(embedding_path)[source]

Load embeddings from HDF5 file

Parameters:

embedding_path (str)

Return type:

ndarray | None

song_phenotyping.embedding.load_flattened_specs(paths_to_specs)[source]

Load all Stage B flattened spectrogram files from a directory.

Concatenates all flattened_*.h5 files found in paths_to_specs into a single set of arrays. Files that cannot be read are skipped with a warning.

Parameters:

paths_to_specs (str) – Path to the syllable_data/flattened/ directory produced by Stage B.

Returns:

  • flattened_specs (numpy.ndarray, shape (n_features, n_syllables)) – Concatenated feature matrix.

  • labels (numpy.ndarray, shape (n_syllables,)) – Syllable labels decoded to str.

  • position_idxs (numpy.ndarray, shape (n_syllables,)) – Position indices within the original recording.

  • hashes (numpy.ndarray, shape (n_syllables,)) – Per-syllable hash strings for cross-stage tracking.

  • song_file_ids (numpy.ndarray, shape (n_syllables,)) – Integer index of the source HDF5 file for each syllable. All syllables from the same song file share the same value, enabling song-level subsampling.

Raises:

ValueError – If no flattened files are found or no valid syllables can be loaded.

Return type:

Tuple[ndarray, ndarray, ndarray, ndarray, ndarray]

song_phenotyping.embedding.main()[source]
song_phenotyping.embedding.monitor_memory_usage()[source]

Return current memory usage info.

song_phenotyping.embedding.phase_aware_spectrogram_distance(spec1, spec2)[source]

Distance metric combining magnitude cosine similarity and phase coherence.

Computes 1 - (0.8 * mag_sim + 0.2 * phase_coherence) so that more similar spectrograms yield smaller distances.

Parameters:
  • spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

  • spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

Returns:

Distance in [0, 1]; lower = more similar.

Return type:

float

song_phenotyping.embedding.save_umap_embeddings(embedding_path, embeddings, hashes, labels=None, metadata=None)[source]

Write a UMAP embedding array and associated metadata to HDF5.

Parameters:
  • embedding_path (str) – Destination .h5 file path.

  • embeddings (numpy.ndarray, shape (n_syllables, n_components)) – UMAP coordinates.

  • hashes (list of str) – Per-syllable hash strings (same order as rows of embeddings).

  • labels (list, optional) – Syllable labels. Stored under the labels node if provided.

  • metadata (dict, optional) – Arbitrary processing metadata stored as HDF5 attributes on the root node.

Returns:

True on success, False if an error occurred.

Return type:

bool

song_phenotyping.embedding.save_umap_model(model_path, umap_model, params)[source]

Pickle a fitted UMAP model and its parameters to disk.

Parameters:
  • model_path (str) – Destination .pkl file path.

  • umap_model (umap.UMAP) – A fitted UMAP model instance.

  • params (UMAPParams) – The hyperparameters used to fit the model.

Returns:

True on success, False if an exception was raised.

Return type:

bool

song_phenotyping.embedding.subsample_by_song(specs, labels, position_idxs, hashes, song_file_ids, max_samples, random_seed=42)[source]

Subsample by selecting whole songs (files) until max_samples is reached.

Sampling whole songs preserves within-song syllable-type distributions far better than uniform random syllable sampling. Songs are shuffled with random_seed, then greedily added in that order until the next song would push the total over max_samples. If max_samples is already satisfied by the data, the arrays are returned unchanged.

Parameters:
  • specs (numpy.ndarray, shape (n_features, n_syllables)) – Feature matrix (columns = syllables).

  • labels (numpy.ndarray, shape (n_syllables,)) – Per-syllable annotation arrays.

  • position_idxs (numpy.ndarray, shape (n_syllables,)) – Per-syllable annotation arrays.

  • hashes (numpy.ndarray, shape (n_syllables,)) – Per-syllable annotation arrays.

  • song_file_ids (numpy.ndarray, shape (n_syllables,)) – Integer file-index for each syllable (from load_flattened_specs()).

  • max_samples (int) – Target syllable budget.

  • random_seed (int, optional) – Seed for reproducible song shuffling. Default 42.

Returns:

specs, labels, position_idxs, hashes – Subsampled arrays containing only syllables from selected songs, sorted in their original chronological order.

Return type:

numpy.ndarray

song_phenotyping.embedding.subsample_data(specs, labels, position_idxs, hashes, max_samples, random_seed=42)[source]

Randomly subsample the data if it exceeds max_samples.

Args:

specs, labels, position_idxs, hashes: Original data arrays max_samples: Maximum number of samples to keep random_seed: Random seed for reproducibility

Returns:

Subsampled versions of input arrays

Parameters:
Return type:

Tuple[ndarray, ndarray, ndarray, ndarray]