Phenotyping (Stage E)
Compute song phenotypes from clustering results (Stage E).
For each bird, this module ingests the top-ranked HDBSCAN clusterings produced by Stage D, reconstructs syllable sequences from raw spec files, and computes a standardised set of phenotypic measures:
Vocabulary — repertoire size and syllable proportions.
Entropy — song-level and transition-level information content, computed both including and excluding introductory notes (
entropy_excl_intro/entropy_scaled_excl_intro).Transitions — first-order Markov transition matrices, with syllable types ordered by their mean
position_in_songfromsyllable_features.csv(so the matrix follows typical song order).Repeats — detection and characterisation of stereotyped syllable repetitions (dyads, longer motifs).
Intro notes — auto-detection of introductory notes (syllables that disproportionately appear at position 0), with recurrence and count stats.
Results are written to phenotype_results.csv in the run’s results/
directory, and detailed data structures (for catalog generation) are
pickled to stages/05_phenotype/.
Public API
phenotype_bird()— run Stage E for a single bird.detect_intro_notes()— identify intro note syllable types from a sequence.PhenotypingConfig— configurable analysis parameters.
- class song_phenotyping.phenotyping.PhenotypingConfig(min_syllable_proportion=0.02, repeat_significance_threshold=0.4, repeat_candidate_range=None, dyad_threshold=0.95, adaptive_repeat_factor=0.25, intro_note_position_threshold=0.5, use_top_n_clusterings=5, generate_plots=True, figure_dpi=300, heatmap_annotation_size=8)[source]
Bases:
objectConfigurable parameters for Stage E phenotyping analysis.
- Parameters:
min_syllable_proportion (float, optional) – Minimum fraction of all syllables a type must represent to be counted toward repertoire size. Default is
0.02.repeat_significance_threshold (float, optional) – Minimum fraction of a syllable’s instances that must participate in a repeat for that repeat to be considered significant. Default is
0.2.repeat_candidate_range (range or None, optional) – Repeat lengths (number of consecutive syllables) to search. Defaults to
range(2, 20)whenNone.dyad_threshold (float, optional) – Fraction of length-2 repeats above which a syllable is classified as a dyad. Default is
0.95.adaptive_repeat_factor (float, optional) – Scales the per-syllable minimum prevalence used to filter spurious repeats. Default is
0.25.use_top_n_clusterings (int, optional) – Number of top-ranked clustering results (from
master_summary.csv) to process. Default is5.generate_plots (bool, optional) – Whether to save scatter plots and heatmaps. Default is
True.figure_dpi (int, optional) – Resolution of saved figures in dots per inch. Default is
300.intro_note_position_threshold (float, optional) – Minimum fraction of songs that must open with a given syllable type for it to be classified as an introductory note. For example, a value of
0.5means the label must be the first syllable in at least half of all songs. This is independent of how many times the intro note repeats per song. Default is0.5.heatmap_annotation_size (int, optional) – Font size for transition-matrix heatmap annotations. Default is
8.
Examples
>>> cfg = PhenotypingConfig(min_syllable_proportion=0.05, generate_plots=False) >>> cfg.repeat_candidate_range range(2, 20)
- song_phenotyping.phenotyping.analyze_repeats(syllables, handler, config)[source]
Analyze syllable repeat patterns and statistics.
- Args:
syllables: Complete syllable sequence including start/stop tokens handler: LabelHandler for the specific label type config: Configuration object with repeat analysis parameters
- Returns:
Dictionary with repeat analysis results
- Parameters:
handler (LabelHandler)
config (PhenotypingConfig)
- Return type:
- song_phenotyping.phenotyping.analyze_transitions(syllables, handler, label_position_order=None)[source]
Analyze syllable transition patterns and matrices.
- Args:
syllables: Complete syllable sequence including start/stop tokens handler: LabelHandler for the specific label type label_position_order: Optional
{label: mean_position_in_song}mappingused to sort syllable types by their typical song position rather than alphabetically.
- Returns:
Dictionary with transition analysis results
- song_phenotyping.phenotyping.analyze_vocabulary_and_entropy(syllables, handler, config, label_position_order=None)[source]
Analyze vocabulary size, syllable proportions, and sequence entropy.
- Args:
syllables: Complete syllable sequence including start/stop tokens handler: LabelHandler for the specific label type config: Configuration object label_position_order: Optional
{label: mean_position_in_song}mappingfor position-ordered transition matrices.
- Returns:
Dictionary with vocabulary and entropy metrics
- Parameters:
handler (LabelHandler)
config (PhenotypingConfig)
label_position_order (Dict | None)
- Return type:
- song_phenotyping.phenotyping.calculate_phenotypes_for_label_type(syllables, label_type, bird_name, config, clustering_metadata=None, label_position_order=None)[source]
Calculate all phenotype metrics for one label type.
- Args:
syllables: List of syllable labels including start/end tokens label_type: ‘manual’ or ‘hdbscan’ bird_name: Bird identifier config: Configuration object clustering_metadata: Optional dict of clustering metadata to merge into results label_position_order: Optional
{label: mean_position_in_song}mapping.When provided, syllable types in transition matrices are ordered by their typical position in the song rather than alphabetically.
- Returns:
Dictionary of phenotype metrics
- song_phenotyping.phenotyping.create_unified_phenotype_row(bird_name, manual_results, auto_results, clustering_results, config, tempo_stats=None)[source]
Create unified phenotype DataFrame with manual labels as first row (if available), followed by ranked automated results.
- Args:
bird_name: Bird identifier manual_results: Dictionary with manual phenotype results auto_results: List of dictionaries with automated phenotype results clustering_results: List of dictionaries with clustering metadata config: Configuration object
- Returns:
DataFrame with phenotype data (manual first, then ranked automated)
- song_phenotyping.phenotyping.detect_intro_notes(syllables, handler, config)[source]
Identify introductory note syllable types from a syllable sequence.
A syllable type is classified as an introductory note when it is the first syllable in at least
config.intro_note_position_thresholdof all songs. This is robust to intro notes that repeat (e.g. 10 consecutive intro notes at song onset) — only the song-level question matters: “did this song open with this label?”, not “what fraction of this label’s instances were first?”.For each identified intro note type the function computes:
Whether the intro note also recurs at non-first positions within songs (
intro_recurs_in_song).Mean and std of intro note counts per song.
The full repeat-length distribution of consecutive intro notes that open each song.
- Parameters:
syllables (list) – Full syllable sequence including start/end tokens.
handler (LabelHandler) – Provides start/end token values.
config (PhenotypingConfig) –
intro_note_position_thresholdcontrols detection sensitivity. Interpreted as the minimum fraction of songs that must open with a given label for it to be classified as an intro note.
- Returns:
Keys:
has_intro_notes(bool)Whether any intro note types were detected.
intro_note_labels(list)Syllable type labels classified as intro notes.
intro_recurs_in_song(bool)True if any intro note type appears at non-opening positions in some songs (i.e. after a non-intro syllable has been sung).
mean_intro_count_per_song(float)Mean number of intro note syllables per song (across all songs).
std_intro_count_per_song(float)Standard deviation of intro note count per song.
intro_repeat_distribution(dict)Mapping
{run_length: count}of consecutive opening intro notes.intro_note_labels_set(set)Fast-lookup set of intro label values (for filtering).
- Return type:
- song_phenotyping.phenotyping.generate_manual_umap_plot(bird_path, syllable_data, manual_results, clustering_result, config, run_name='default')[source]
Generate UMAP plot colored by manual labels to match automated clustering plots.
- song_phenotyping.phenotyping.load_bird_syllable_data(bird_path, run_name='default')[source]
Load manual syllable labels from a bird’s Stage A spec files.
- Parameters:
- Returns:
Keys:
'manual_syllables'(list),'song_paths'(list of filenames),'bird_name'(str),'syllables_dir'(str).'manual_syllables'is empty when no manual labels are found.- Return type:
- song_phenotyping.phenotyping.load_clustering_labels_for_syllables(clustering_result, syllable_data)[source]
Load clustering labels and map them to syllable sequences. Updated to work with spec files in syllable_data/specs/.
- song_phenotyping.phenotyping.load_clustering_results(bird_path, top_n=5, run_name='default')[source]
Load the top-top_n clustering results from
master_summary.csv.- Parameters:
- Returns:
One dict per row with keys:
'rank','label_path','composite_score','nmi','silhouette','dbi','n_clusters','clustering_method','n_neighbors','min_dist','metric','min_cluster_size','min_samples'. Returns an empty list if the CSV is absent or cannot be read.- Return type:
- song_phenotyping.phenotyping.load_tempo_stats(syllables_dir)[source]
Compute per-bird tempo summary statistics from Stage-A HDF5 tempo fields.
Iterates over every HDF5 file in syllables_dir and collects: - per-song syllable rate from onset inter-onset-intervals - dominant high-band tempo peak (first non-NaN entry of tempo_high_band_freqs) - low-band tempo peak (tempo_low_band_freq)
Returns a flat dict of mean/std for each quantity (NaN when no data).
- song_phenotyping.phenotyping.main(save_path)[source]
Main function to run the unified phenotyping pipeline.
- Args:
save_path: Root directory containing bird data
- Parameters:
save_path (str)
- Return type:
None
- song_phenotyping.phenotyping.phenotype_bird(bird_path, config=None, run_name='default')[source]
Run the complete Stage E phenotyping pipeline for one bird.
Processes manual labels (when available) and the top automated clustering results from Stage D. Writes:
<bird_path>/phenotype_results.csv— one row per clustering rank.<bird_path>/syllable_data/phenotype_detailed/automated_phenotype_data_rank*.pkl— detailed data structures for PDF generation.
- Parameters:
bird_path (str) – Absolute path to the bird’s root directory.
config (PhenotypingConfig or None, optional) – Analysis parameters. Uses default
PhenotypingConfigwhenNone.run_name (str)
- Returns:
Trueon success,Falseif a fatal error occurred.- Return type:
Examples
>>> from song_phenotyping.phenotyping import phenotype_bird, PhenotypingConfig >>> cfg = PhenotypingConfig(generate_plots=False) >>> phenotype_bird("/Volumes/Extreme SSD/pipeline_runs/or18or24", cfg) True
See also
song_phenotyping.labelling.label_birdStage D (produces input).
- song_phenotyping.phenotyping.plot_repeat_patterns(plots_dir, repeat_counts, rank_str, bird_name, config)[source]
Generate and save repeat pattern heatmap.
- Args:
plots_dir: Directory to save plot repeat_counts: DataFrame with repeat counts by syllable and repeat length rank_str: Rank identifier (e.g., “rank0”) bird_name: Bird identifier config: Configuration object
- Returns:
Path to generated image file
- Parameters:
plots_dir (str)
repeat_counts (DataFrame)
rank_str (str)
bird_name (str)
config (PhenotypingConfig)
- Return type:
- song_phenotyping.phenotyping.plot_transition_matrices(plots_dir, auto_result, rank_str, bird_name, config)[source]
Generate and save transition matrix heatmaps.
- Args:
plots_dir: Directory to save plots auto_result: Automated phenotype results containing transition matrices rank_str: Rank identifier (e.g., “rank0”) bird_name: Bird identifier config: Configuration object
- Returns:
List of paths to generated image files
- song_phenotyping.phenotyping.plot_vocabulary_comparison(plots_dir, manual_results, auto_results, bird_name, config)[source]
Generate and save vocabulary size comparison plot between manual and automated labels.
- Args:
plots_dir: Directory to save plot manual_results: Manual phenotype results auto_results: List of automated phenotype results bird_name: Bird identifier config: Configuration object
- Returns:
Path to generated image file
- song_phenotyping.phenotyping.save_detailed_phenotype_data(bird_path, manual_results, auto_results, clustering_results, run_name='default')[source]
Pickle detailed phenotype data structures for downstream PDF generation.
Writes one
.pklfile per clustering rank to<bird_path>/syllable_data/phenotype_detailed/.- Parameters:
bird_path (str) – Bird root directory.
manual_results (dict) – Phenotype results computed from manual labels.
auto_results (list of dict) – Phenotype results for each automated clustering rank.
clustering_results (list of dict) – Clustering metadata dicts (as returned by
load_clustering_results()).Returns – bool: Success status
run_name (str)
- Return type: