Installation

Requirements

Python 3.9 or later
A working conda environment (recommended)

Install from source

Clone the repository and install in editable mode:

git clone https://github.com/annie-taylor/song_phenotyping.git
cd song_phenotyping
conda env create -f environment.yml
conda activate song_phenotyping
pip install -e .

This installs the song_phenotyping package and all runtime dependencies (NumPy, SciPy, PyTables, UMAP-learn, HDBSCAN, pandas, scikit-learn, PyYAML, tqdm, matplotlib).

Optional extras

To build the documentation locally:

pip install -e ".[docs]"
cd docs && make html

To install development tools (pytest):

pip install -e ".[dev]"

Configuration

The pipeline reads all settings from config.yaml in the project root. This file is gitignored — each machine keeps its own copy. A fully annotated template is provided at config.yaml.example.

Step 1 — create your local config

cp config.yaml.example config.yaml

Then open config.yaml and edit the values for your machine. You only need to set the two required path fields; everything else has a sensible default.

Step 2 — set required paths

paths:
  local_cache: /Volumes/Extreme SSD   # macOS example
  # local_cache: E:/                  # Windows example

  run_registry: db.sqlite3            # SQLite run log; relative = project root

pipeline:
  save_path: /Volumes/Extreme SSD/pipeline_runs
  evsong_source: /Volumes/Extreme SSD/birds/evsong

local_cache

Root of your local data cache (an external SSD or a fast local directory). Bird data and pipeline outputs that need to be kept close to the machine should live under here.

run_registry

Path to the SQLite database that records every pipeline run. A relative path is resolved from the project root. The default (db.sqlite3) places it in the project root — fine for most setups.

save_path

Root output directory. One sub-directory per bird is created here:

<save_path>/
└── or18or24/
    ├── stages/       ← per-stage HDF5 artefacts
    └── results/      ← human-facing CSVs, plots, HTML catalogs

evsong_source

Parent directory that contains evsonganaly bird folders. Each bird is expected at <evsong_source>/<bird_id>/source/. Set to null to skip evsonganaly processing.

Step 3 — optional pipeline settings

All keys below have defaults; omit any key to accept the default.

pipeline:
  wseg_metadata: null      # WhisperSeg metadata dir; null = skip wseg
  birds: null              # null = all; or [or18or24, bu78bu77]
  songs_per_bird: 30       # null = all songs; small number for quick tests
  songs_seed: 42           # null = random; int = reproducible subset
  run_name: null           # null = auto-hash; str = fixed name (e.g. "baseline")
  generate_catalog: true   # write HTML catalogs after Stage E

wseg_metadata

Directory containing WhisperSeg-format metadata. Set only if you have wseg annotations; null skips that ingestion path entirely.

birds

null auto-discovers every bird found under evsong_source / wseg_metadata. To process a subset, provide an explicit list:

birds: [or18or24, bu78bu77]

songs_per_bird

Limits the number of songs loaded per bird in Stage A. Useful for quick smoke tests (set to 5–10) without touching the rest of the config. null processes all available songs.

songs_seed

When songs_per_bird is less than the total available songs, Stage A selects a random subset. Setting a seed makes the selection reproducible across machines and re-runs. null gives a fresh random draw each time.

run_name

By default (null) the pipeline computes a short hash from all computational parameters and uses it as the run directory name. Changing any parameter automatically produces a new directory, so old results are never silently overwritten. Set a fixed string (e.g. "baseline") if you want to pin a name across runs regardless of parameters.

generate_catalog

Whether to build HTML syllable catalogs after Stage E completes. The catalogs are written to <save_path>/<bird>/results/catalog/ and can be opened in any browser without a server.

Spectrogram parameters (Stage A)

pipeline:
  spectrograms:
    nfft: 512
    hop: 1
    target_shape: [513, 300]   # [freq_bins, time_bins]
    min_freq: 400.0
    max_freq: 10000.0
    max_dur: 0.080             # seconds
    fs: 32000.0
    save_inst_freq: false
    save_group_delay: false
    duration_feature_weight: 0.0
    use_warping: false

nfft: FFT window size in samples. Larger values give finer frequency resolution at the cost of time resolution.
target_shape: [freq_bins, time_bins] — the fixed size each syllable spectrogram is resampled to. All downstream stages expect this shape; changing it invalidates any existing Stage A outputs.
min_freq / max_freq: Frequency band (Hz) retained in the spectrogram. Set to match the typical range of the species you are analysing.
max_dur: Syllables longer than this duration (seconds) are truncated or split.
save_inst_freq / save_group_delay: Append an instantaneous-frequency or group-delay channel alongside the magnitude spectrogram. Disabled by default; may improve clustering for some species.
duration_feature_weight: 0.0 (default) disables duration as a feature. Setting to ~1.0 appends a synthetic “duration” time-bin column, giving UMAP a hint about syllable length.
overwrite_existing (Stage A): false (default) skips any bird/song whose spectrogram HDF5 file is already present on disk, making re-runs fast. Set to true to force recomputation — useful when the source audio has changed or you have altered spectrogram parameters that affect the output (e.g. nfft, target_shape).

Note

Changing computational parameters (nfft, target_shape, etc.) automatically produces a new run directory via the config hash, so you rarely need to set overwrite_existing: true unless you want to rebuild within the same run.

Embedding parameters (Stage C)

pipeline:
  embedding:
    n_neighbors: 20
    min_dist: 0.5
    metric: euclidean
    n_components: 2
    subsample_seed: 42
    n_neighbors_grid: [5, 10, 20, 50, 100]
    min_dist_grid: [0.01, 0.05, 0.1, 0.3, 0.5]
    max_workers: null
    overwrite: false

n_neighbors / min_dist / metric

Default UMAP hyperparameters, used as starting values for the grid search and as the single-run default.

n_neighbors_grid / min_dist_grid

All combinations are tried during the grid search. The best is chosen by the Stage D clustering metrics. Narrowing these lists speeds up Stage C significantly.

subsample_seed

When the dataset is too large for UMAP, a random subset of syllables is drawn. This seed makes that draw reproducible.

max_workers

null lets the pipeline decide how many parallel workers to use based on available memory. Set an integer to cap usage on shared machines.

overwrite (Stage C)

false (default) skips any UMAP parameter combination whose .h5 embedding file already exists on disk and is compatible with the current data (verified by sample count and hash overlap). This makes iterative runs much faster — only new grid combinations are computed.

Set to true to force recomputation of all embeddings. This is needed when the flattened features have changed (e.g. after re-running Stage A or B with different parameters within the same run directory).

Note

Like overwrite_existing, this flag is excluded from the run-name hash — toggling it does not create a new run directory.

Clustering parameters (Stage D)

pipeline:
  labelling:
    metrics: [silhouette, dbi]
    # metric_weights:
    #   silhouette: 2.0
    #   dbi: 1.0
    replace_labels: false
    hdbscan_grid: null
    generate_cluster_pdf: false
    max_workers: null

metrics

Clustering quality metrics used to rank HDBSCAN results. Supported values: silhouette, dbi, ch, dunn, nmi. ch (Calinski-Harabasz) is excluded by default as it tends to favour large clusters.

metric_weights

Optional per-metric weights for the composite ranking score. Comment out to use equal weights.

replace_labels (Stage D)

false (default) skips re-clustering if label files already exist for a parameter combination. Set to true to delete existing label files and force a full re-run — equivalent to the overwrite flags above but scoped to Stage D.

hdbscan_grid

null uses the built-in default grid (min_cluster_size ∈ [5, 20, 60], min_samples ∈ [5, 15]). Override with a custom grid:

hdbscan_grid:
  min_cluster_size: [10, 30]
  min_samples: [5, 10]

Phenotyping parameters (Stage E)

pipeline:
  phenotyping:
    min_syllable_proportion: 0.02
    repeat_significance_threshold: 0.4
    dyad_threshold: 0.95
    use_top_n_clusterings: 3
    generate_plots: true

min_syllable_proportion: Syllable types that appear in fewer than this fraction of songs are excluded from phenotype calculations.
repeat_significance_threshold / dyad_threshold: Thresholds governing repeat and dyad detection in the transition matrix analysis.
use_top_n_clusterings: How many top-ranked clusterings (from Stage D) are averaged to produce the final phenotype.
generate_plots: Write phenotype summary figures to <save_path>/<bird>/results/plots/.

Macaw server configuration

The pipeline can read source audio directly from the Macaw network server. By default it attempts to auto-detect the mount point based on your operating system:

macOS: /Volumes/users/
Linux: /run/user/1005/gvfs/smb-share:server=macaw.local,share=users/
Windows: Z:\\

These defaults are lab-specific and may not match your setup — in particular the Linux path encodes a user ID (1005) that differs between accounts, and the Windows drive letter varies by machine.

To set your Macaw mount point explicitly, add it to config.yaml:

paths:
  macaw_root: /Volumes/users        # macOS (standard mount)
  # macaw_root: /run/user/1234/gvfs/smb-share:server=macaw.local,share=users/
  # macaw_root: Z:/                 # Windows

If Macaw is not used in your workflow, set it to null (or leave it absent) and the pipeline will simply skip any server-path resolution:

paths:
  macaw_root: null

You can verify the detected root at runtime:

from song_phenotyping.tools.project_config import ProjectConfig
cfg = ProjectConfig.load()
print(cfg.macaw_root)   # None if not mounted or not configured