Embedding (Stage C)

UMAP dimensionality reduction for syllable spectrograms (Stage C).

Takes the flattened spectrogram feature matrices produced by Stage B and reduces them to 2-D embeddings using UMAP. A grid of (min_dist, n_neighbors) combinations is explored; each embedding is saved as an HDF5 file and the corresponding UMAP model is pickled for later projection of new data.

Parallel processing is used by default, with adaptive worker-count and memory-limit logic to avoid OOM failures on large datasets.

Public API

UMAPParams — UMAP hyperparameter dataclass
explore_embedding_parameters_robust() — Stage C entry point
load_flattened_specs() — load Stage B output for downstream use

class song_phenotyping.embedding.UMAPParams(n_neighbors=20, min_dist=0.5, metric='euclidean', n_components=2, n_epochs=10)[source]

Bases: object

Hyperparameters for a single UMAP embedding run.

Parameters:

n_neighbors (int, optional) – Number of neighbouring points used in the local approximation of the manifold structure. Larger values produce a more global view; smaller values preserve finer local structure. Default is 20.
min_dist (float, optional) – Minimum distance between points in the 2-D embedding. Controls how tightly UMAP packs points together. Default is 0.5.
metric (str, optional) – Distance metric used in the high-dimensional input space. Default is 'euclidean'.
n_components (int, optional) – Dimensionality of the embedding. Default is 2.
n_epochs (int, optional) – Number of training epochs. Increase for higher-quality embeddings at the cost of compute time. Default is 10.

Raises:

ValueError – If any parameter is outside its valid range.

Examples

>>> params = UMAPParams(n_neighbors=10, min_dist=0.1)
>>> params.to_dict()
{'n_neighbors': 10, 'min_dist': 0.1, 'metric': 'euclidean', 'n_components': 2, 'n_epochs': 10}

classmethod from_dict(data)[source]

Construct a UMAPParams from a dictionary.

Parameters:: data (dict) – Must contain the keys produced by to_dict().
Return type:: UMAPParams

metric: str = 'euclidean'

min_dist: float = 0.5

n_components: int = 2

n_epochs: int = 10

n_neighbors: int = 20

to_dict()[source]

Return all parameters as a plain dictionary.

Returns:: Keys: n_neighbors, min_dist, metric, n_components, n_epochs.
Return type:: dict

validate_params()[source]: Raise ValueError if any parameter is out of range.

song_phenotyping.embedding.calculate_adaptive_workers_improved(n_samples, n_features, max_n_neighbors, memory_per_worker_gb=None, max_workers=None)[source]

Improved worker calculation with better memory estimation.

Parameters:

n_samples (int)
n_features (int)
max_n_neighbors (int)
memory_per_worker_gb (float)
max_workers (int | None)

Return type:

int

song_phenotyping.embedding.calculate_safe_batch_size(available_memory_gb, n_features, max_n_neighbors, safety_factor=0.7)[source]

Calculate maximum safe batch size for UMAP given memory constraints.

Parameters:

available_memory_gb (float)
n_features (int)
max_n_neighbors (int)
safety_factor (float)

Return type:

int

song_phenotyping.embedding.check_embedding_compatibility(embedding_path, current_n_samples, current_hashes, overwrite=False)[source]

Check if existing embedding is compatible with current data.

Returns:: True if compatible or should proceed, False if should skip

Parameters:

embedding_path (str)
current_n_samples (int)
current_hashes (list)
overwrite (bool)

Return type:

bool

song_phenotyping.embedding.compare_umap_embeddings_plot(successful_params, min_dists, n_neighbors, paths, save_path, bird='', processing_metadata=None)[source]

Create comparison plot by loading embeddings on-demand. Updated to handle new filename structure with sample info.

Parameters:

successful_params (List[Tuple[int, float]])
paths (dict)
save_path (str)
bird (str)
processing_metadata (dict)

song_phenotyping.embedding.complex_spectrogram_distance(spec1, spec2)[source]

L2 distance between two complex spectrograms.

Parameters:

spec1 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.
spec2 (np.ndarray) – Complex-valued spectrogram arrays of the same shape.

Returns:

Frobenius norm of the element-wise difference.

Return type:

float

song_phenotyping.embedding.compute_and_save_umap_memory_aware(samples, labels, hashes, params, model_path, embedding_path, save_model=False, overwrite=False, processing_metadata=None)[source]

Memory-aware UMAP computation with fallback strategies.

Parameters:

samples (ndarray)
params (UMAPParams)
model_path (str)
embedding_path (str)
save_model (bool)
overwrite (bool)
processing_metadata (dict)

Return type:

Tuple[ndarray | None, umap.UMAP | None, bool]

song_phenotyping.embedding.compute_embedding_grid_parallel_robust(samples, labels, hashes, min_dists, n_neighbors, paths, plot=True, bird='', max_workers=None, overwrite=False, processing_metadata=None)[source]

Robust parallel version with memory monitoring and graceful degradation.

Parameters:

plot (bool)
bird (str)
max_workers (int)
overwrite (bool)
processing_metadata (dict)

song_phenotyping.embedding.compute_single_umap_worker_safe(args)[source]: Enhanced UMAP worker with memory monitoring and fallback.

song_phenotyping.embedding.estimate_umap_memory_usage(n_samples, n_features, n_neighbors)[source]

More accurate UMAP memory estimation based on algorithm internals.

Returns estimated peak memory usage in GB.

Parameters:

n_samples (int)
n_features (int)
n_neighbors (int)

Return type:

float

song_phenotyping.embedding.explore_embedding_parameters_robust(save_path, bird, min_dists=None, n_neighbors_list=None, use_parallel=True, overwrite=False, max_samples=None, memory_per_worker_gb=None, auto_memory_management=True, subsample_seed=42, run_name='default', max_workers=None)[source]

Compute UMAP embeddings over a parameter grid for one bird (Stage C entry point).

Loads Stage B flattened spectrograms, optionally subsamples them to fit available memory, and runs UMAP for every (min_dist, n_neighbors) combination. Each embedding is saved to <save_path>/<bird>/syllable_data/embeddings/<params>.h5 and the corresponding model is pickled to <save_path>/<bird>/syllable_data/models/.

Parameters:

save_path (str) – Project root directory containing bird subdirectories.
bird (str) – Bird identifier (e.g. 'or18or24').
min_dists (list of float, optional) – min_dist values to explore. Defaults to [0.01, 0.05, 0.1, 0.3, 0.5].
n_neighbors_list (list of int, optional) – n_neighbors values to explore. Defaults to [5, 10, 20, 50, 100].
use_parallel (bool, optional) – Run parameter combinations in parallel using ProcessPoolExecutor. Automatically falls back to sequential if only one worker is safe. Default is True.
overwrite (bool, optional) – Recompute embeddings that already exist on disk. Default is False.
max_samples (int, optional) – Hard cap on syllables passed to UMAP. None uses the memory-aware safe batch size.
memory_per_worker_gb (float, optional) – Reserved memory per parallel worker in GB. None uses automatic estimation.
auto_memory_management (bool, optional) – Dynamically cap max_samples based on available RAM. Default is True.
subsample_seed (int, optional) – Random seed for reproducible subsampling. Default is 42.
run_name (str)
max_workers (int | None)

Returns:

True if at least one embedding was computed successfully; False otherwise.

Return type:

bool