Phenotyping (Stage E)

Compute song phenotypes from clustering results (Stage E).

For each bird, this module ingests the top-ranked HDBSCAN clusterings produced by Stage D, reconstructs syllable sequences from raw spec files, and computes a standardised set of phenotypic measures:

  • Vocabulary — repertoire size and syllable proportions.

  • Entropy — song-level and transition-level information content, computed both including and excluding introductory notes (entropy_excl_intro / entropy_scaled_excl_intro).

  • Transitions — first-order Markov transition matrices, with syllable types ordered by their mean position_in_song from syllable_features.csv (so the matrix follows typical song order).

  • Repeats — detection and characterisation of stereotyped syllable repetitions (dyads, longer motifs).

  • Intro notes — auto-detection of introductory notes (syllables that disproportionately appear at position 0), with recurrence and count stats.

Results are written to phenotype_results.csv in the run’s results/ directory, and detailed data structures (for catalog generation) are pickled to stages/05_phenotype/.

Public API

class song_phenotyping.phenotyping.PhenotypingConfig(min_syllable_proportion=0.02, repeat_significance_threshold=0.4, repeat_candidate_range=None, dyad_threshold=0.95, adaptive_repeat_factor=0.25, intro_note_position_threshold=0.5, use_top_n_clusterings=5, generate_plots=True, figure_dpi=300, heatmap_annotation_size=8)[source]

Bases: object

Configurable parameters for Stage E phenotyping analysis.

Parameters:
  • min_syllable_proportion (float, optional) – Minimum fraction of all syllables a type must represent to be counted toward repertoire size. Default is 0.02.

  • repeat_significance_threshold (float, optional) – Minimum fraction of a syllable’s instances that must participate in a repeat for that repeat to be considered significant. Default is 0.2.

  • repeat_candidate_range (range or None, optional) – Repeat lengths (number of consecutive syllables) to search. Defaults to range(2, 20) when None.

  • dyad_threshold (float, optional) – Fraction of length-2 repeats above which a syllable is classified as a dyad. Default is 0.95.

  • adaptive_repeat_factor (float, optional) – Scales the per-syllable minimum prevalence used to filter spurious repeats. Default is 0.25.

  • use_top_n_clusterings (int, optional) – Number of top-ranked clustering results (from master_summary.csv) to process. Default is 5.

  • generate_plots (bool, optional) – Whether to save scatter plots and heatmaps. Default is True.

  • figure_dpi (int, optional) – Resolution of saved figures in dots per inch. Default is 300.

  • intro_note_position_threshold (float, optional) – Minimum fraction of songs that must open with a given syllable type for it to be classified as an introductory note. For example, a value of 0.5 means the label must be the first syllable in at least half of all songs. This is independent of how many times the intro note repeats per song. Default is 0.5.

  • heatmap_annotation_size (int, optional) – Font size for transition-matrix heatmap annotations. Default is 8.

Examples

>>> cfg = PhenotypingConfig(min_syllable_proportion=0.05, generate_plots=False)
>>> cfg.repeat_candidate_range
range(2, 20)
adaptive_repeat_factor: float = 0.25
dyad_threshold: float = 0.95
figure_dpi: int = 300
generate_plots: bool = True
heatmap_annotation_size: int = 8
intro_note_position_threshold: float = 0.5
min_syllable_proportion: float = 0.02
repeat_candidate_range: range = None
repeat_significance_threshold: float = 0.4
use_top_n_clusterings: int = 5
song_phenotyping.phenotyping.analyze_repeats(syllables, handler, config)[source]

Analyze syllable repeat patterns and statistics.

Args:

syllables: Complete syllable sequence including start/stop tokens handler: LabelHandler for the specific label type config: Configuration object with repeat analysis parameters

Returns:

Dictionary with repeat analysis results

Parameters:
Return type:

Dict[str, Any]

song_phenotyping.phenotyping.analyze_transitions(syllables, handler, label_position_order=None)[source]

Analyze syllable transition patterns and matrices.

Args:

syllables: Complete syllable sequence including start/stop tokens handler: LabelHandler for the specific label type label_position_order: Optional {label: mean_position_in_song} mapping

used to sort syllable types by their typical song position rather than alphabetically.

Returns:

Dictionary with transition analysis results

Parameters:
Return type:

Dict[str, Any]

song_phenotyping.phenotyping.analyze_vocabulary_and_entropy(syllables, handler, config, label_position_order=None)[source]

Analyze vocabulary size, syllable proportions, and sequence entropy.

Args:

syllables: Complete syllable sequence including start/stop tokens handler: LabelHandler for the specific label type config: Configuration object label_position_order: Optional {label: mean_position_in_song} mapping

for position-ordered transition matrices.

Returns:

Dictionary with vocabulary and entropy metrics

Parameters:
Return type:

Dict[str, Any]

song_phenotyping.phenotyping.calculate_phenotypes_for_label_type(syllables, label_type, bird_name, config, clustering_metadata=None, label_position_order=None)[source]

Calculate all phenotype metrics for one label type.

Args:

syllables: List of syllable labels including start/end tokens label_type: ‘manual’ or ‘hdbscan’ bird_name: Bird identifier config: Configuration object clustering_metadata: Optional dict of clustering metadata to merge into results label_position_order: Optional {label: mean_position_in_song} mapping.

When provided, syllable types in transition matrices are ordered by their typical position in the song rather than alphabetically.

Returns:

Dictionary of phenotype metrics

Parameters:
Return type:

Dict[str, Any]

song_phenotyping.phenotyping.create_unified_phenotype_row(bird_name, manual_results, auto_results, clustering_results, config, tempo_stats=None)[source]

Create unified phenotype DataFrame with manual labels as first row (if available), followed by ranked automated results.

Args:

bird_name: Bird identifier manual_results: Dictionary with manual phenotype results auto_results: List of dictionaries with automated phenotype results clustering_results: List of dictionaries with clustering metadata config: Configuration object

Returns:

DataFrame with phenotype data (manual first, then ranked automated)

Parameters:
Return type:

DataFrame

song_phenotyping.phenotyping.detect_intro_notes(syllables, handler, config)[source]

Identify introductory note syllable types from a syllable sequence.

A syllable type is classified as an introductory note when it is the first syllable in at least config.intro_note_position_threshold of all songs. This is robust to intro notes that repeat (e.g. 10 consecutive intro notes at song onset) — only the song-level question matters: “did this song open with this label?”, not “what fraction of this label’s instances were first?”.

For each identified intro note type the function computes:

  • Whether the intro note also recurs at non-first positions within songs (intro_recurs_in_song).

  • Mean and std of intro note counts per song.

  • The full repeat-length distribution of consecutive intro notes that open each song.

Parameters:
  • syllables (list) – Full syllable sequence including start/end tokens.

  • handler (LabelHandler) – Provides start/end token values.

  • config (PhenotypingConfig) – intro_note_position_threshold controls detection sensitivity. Interpreted as the minimum fraction of songs that must open with a given label for it to be classified as an intro note.

Returns:

Keys:

has_intro_notes (bool)

Whether any intro note types were detected.

intro_note_labels (list)

Syllable type labels classified as intro notes.

intro_recurs_in_song (bool)

True if any intro note type appears at non-opening positions in some songs (i.e. after a non-intro syllable has been sung).

mean_intro_count_per_song (float)

Mean number of intro note syllables per song (across all songs).

std_intro_count_per_song (float)

Standard deviation of intro note count per song.

intro_repeat_distribution (dict)

Mapping {run_length: count} of consecutive opening intro notes.

intro_note_labels_set (set)

Fast-lookup set of intro label values (for filtering).

Return type:

dict

song_phenotyping.phenotyping.generate_manual_umap_plot(bird_path, syllable_data, manual_results, clustering_result, config, run_name='default')[source]

Generate UMAP plot colored by manual labels to match automated clustering plots.

Parameters:
Return type:

str

song_phenotyping.phenotyping.load_bird_syllable_data(bird_path, run_name='default')[source]

Load manual syllable labels from a bird’s Stage A spec files.

Parameters:
  • bird_path (str) – Absolute path to the bird’s root directory (contains syllable_data/specs/).

  • run_name (str)

Returns:

Keys: 'manual_syllables' (list), 'song_paths' (list of filenames), 'bird_name' (str), 'syllables_dir' (str). 'manual_syllables' is empty when no manual labels are found.

Return type:

dict

song_phenotyping.phenotyping.load_clustering_labels_for_syllables(clustering_result, syllable_data)[source]

Load clustering labels and map them to syllable sequences. Updated to work with spec files in syllable_data/specs/.

Parameters:
Return type:

List[str | int]

song_phenotyping.phenotyping.load_clustering_results(bird_path, top_n=5, run_name='default')[source]

Load the top-top_n clustering results from master_summary.csv.

Parameters:
  • bird_path (str) – Bird root directory containing master_summary.csv.

  • top_n (int, optional) – Number of rows to return (rows are already ranked by composite score in the CSV). Default is 5.

  • run_name (str)

Returns:

One dict per row with keys: 'rank', 'label_path', 'composite_score', 'nmi', 'silhouette', 'dbi', 'n_clusters', 'clustering_method', 'n_neighbors', 'min_dist', 'metric', 'min_cluster_size', 'min_samples'. Returns an empty list if the CSV is absent or cannot be read.

Return type:

list of dict

song_phenotyping.phenotyping.load_tempo_stats(syllables_dir)[source]

Compute per-bird tempo summary statistics from Stage-A HDF5 tempo fields.

Iterates over every HDF5 file in syllables_dir and collects: - per-song syllable rate from onset inter-onset-intervals - dominant high-band tempo peak (first non-NaN entry of tempo_high_band_freqs) - low-band tempo peak (tempo_low_band_freq)

Returns a flat dict of mean/std for each quantity (NaN when no data).

Parameters:

syllables_dir (str)

Return type:

Dict[str, float]

song_phenotyping.phenotyping.main(save_path)[source]

Main function to run the unified phenotyping pipeline.

Args:

save_path: Root directory containing bird data

Parameters:

save_path (str)

Return type:

None

song_phenotyping.phenotyping.phenotype_bird(bird_path, config=None, run_name='default')[source]

Run the complete Stage E phenotyping pipeline for one bird.

Processes manual labels (when available) and the top automated clustering results from Stage D. Writes:

  • <bird_path>/phenotype_results.csv — one row per clustering rank.

  • <bird_path>/syllable_data/phenotype_detailed/automated_phenotype_data_rank*.pkl — detailed data structures for PDF generation.

Parameters:
Returns:

True on success, False if a fatal error occurred.

Return type:

bool

Examples

>>> from song_phenotyping.phenotyping import phenotype_bird, PhenotypingConfig
>>> cfg = PhenotypingConfig(generate_plots=False)
>>> phenotype_bird("/Volumes/Extreme SSD/pipeline_runs/or18or24", cfg)
True

See also

song_phenotyping.labelling.label_bird

Stage D (produces input).

song_phenotyping.phenotyping.plot_repeat_patterns(plots_dir, repeat_counts, rank_str, bird_name, config)[source]

Generate and save repeat pattern heatmap.

Args:

plots_dir: Directory to save plot repeat_counts: DataFrame with repeat counts by syllable and repeat length rank_str: Rank identifier (e.g., “rank0”) bird_name: Bird identifier config: Configuration object

Returns:

Path to generated image file

Parameters:
Return type:

str

song_phenotyping.phenotyping.plot_transition_matrices(plots_dir, auto_result, rank_str, bird_name, config)[source]

Generate and save transition matrix heatmaps.

Args:

plots_dir: Directory to save plots auto_result: Automated phenotype results containing transition matrices rank_str: Rank identifier (e.g., “rank0”) bird_name: Bird identifier config: Configuration object

Returns:

List of paths to generated image files

Parameters:
Return type:

List[str]

song_phenotyping.phenotyping.plot_vocabulary_comparison(plots_dir, manual_results, auto_results, bird_name, config)[source]

Generate and save vocabulary size comparison plot between manual and automated labels.

Args:

plots_dir: Directory to save plot manual_results: Manual phenotype results auto_results: List of automated phenotype results bird_name: Bird identifier config: Configuration object

Returns:

Path to generated image file

Parameters:
Return type:

str

song_phenotyping.phenotyping.save_detailed_phenotype_data(bird_path, manual_results, auto_results, clustering_results, run_name='default')[source]

Pickle detailed phenotype data structures for downstream PDF generation.

Writes one .pkl file per clustering rank to <bird_path>/syllable_data/phenotype_detailed/.

Parameters:
  • bird_path (str) – Bird root directory.

  • manual_results (dict) – Phenotype results computed from manual labels.

  • auto_results (list of dict) – Phenotype results for each automated clustering rank.

  • clustering_results (list of dict) – Clustering metadata dicts (as returned by load_clustering_results()).

  • Returns – bool: Success status

  • run_name (str)

Return type:

bool