Installation
Requirements
Python 3.9 or later
A working conda environment (recommended)
Install from source
Clone the repository and install in editable mode:
git clone https://github.com/annie-taylor/song_phenotyping.git
cd song_phenotyping
conda env create -f environment.yml
conda activate song_phenotyping
pip install -e .
This installs the song_phenotyping package and all runtime dependencies
(NumPy, SciPy, PyTables, UMAP-learn, HDBSCAN, pandas, scikit-learn, PyYAML,
tqdm, matplotlib).
Optional extras
To build the documentation locally:
pip install -e ".[docs]"
cd docs && make html
To install development tools (pytest):
pip install -e ".[dev]"
Configuration
The pipeline reads all settings from config.yaml in the project root.
This file is gitignored — each machine keeps its own copy. A fully
annotated template is provided at config.yaml.example.
Step 1 — create your local config
cp config.yaml.example config.yaml
Then open config.yaml and edit the values for your machine. You only
need to set the two required path fields; everything else has a sensible
default.
Step 2 — set required paths
paths:
local_cache: /Volumes/Extreme SSD # macOS example
# local_cache: E:/ # Windows example
run_registry: db.sqlite3 # SQLite run log; relative = project root
pipeline:
save_path: /Volumes/Extreme SSD/pipeline_runs
evsong_source: /Volumes/Extreme SSD/birds/evsong
local_cacheRoot of your local data cache (an external SSD or a fast local directory). Bird data and pipeline outputs that need to be kept close to the machine should live under here.
run_registryPath to the SQLite database that records every pipeline run. A relative path is resolved from the project root. The default (
db.sqlite3) places it in the project root — fine for most setups.save_pathRoot output directory. One sub-directory per bird is created here:
<save_path>/ └── or18or24/ ├── stages/ ← per-stage HDF5 artefacts └── results/ ← human-facing CSVs, plots, HTML catalogsevsong_sourceParent directory that contains evsonganaly bird folders. Each bird is expected at
<evsong_source>/<bird_id>/source/. Set tonullto skip evsonganaly processing.
Step 3 — optional pipeline settings
All keys below have defaults; omit any key to accept the default.
pipeline:
wseg_metadata: null # WhisperSeg metadata dir; null = skip wseg
birds: null # null = all; or [or18or24, bu78bu77]
songs_per_bird: 30 # null = all songs; small number for quick tests
songs_seed: 42 # null = random; int = reproducible subset
run_name: null # null = auto-hash; str = fixed name (e.g. "baseline")
generate_catalog: true # write HTML catalogs after Stage E
wseg_metadataDirectory containing WhisperSeg-format metadata. Set only if you have wseg annotations;
nullskips that ingestion path entirely.birdsnullauto-discovers every bird found underevsong_source/wseg_metadata. To process a subset, provide an explicit list:birds: [or18or24, bu78bu77]
songs_per_birdLimits the number of songs loaded per bird in Stage A. Useful for quick smoke tests (set to 5–10) without touching the rest of the config.
nullprocesses all available songs.songs_seedWhen
songs_per_birdis less than the total available songs, Stage A selects a random subset. Setting a seed makes the selection reproducible across machines and re-runs.nullgives a fresh random draw each time.run_nameBy default (
null) the pipeline computes a short hash from all computational parameters and uses it as the run directory name. Changing any parameter automatically produces a new directory, so old results are never silently overwritten. Set a fixed string (e.g."baseline") if you want to pin a name across runs regardless of parameters.generate_catalogWhether to build HTML syllable catalogs after Stage E completes. The catalogs are written to
<save_path>/<bird>/results/catalog/and can be opened in any browser without a server.
Spectrogram parameters (Stage A)
pipeline:
spectrograms:
nfft: 512
hop: 1
target_shape: [513, 300] # [freq_bins, time_bins]
min_freq: 400.0
max_freq: 10000.0
max_dur: 0.080 # seconds
fs: 32000.0
save_inst_freq: false
save_group_delay: false
duration_feature_weight: 0.0
use_warping: false
nfftFFT window size in samples. Larger values give finer frequency resolution at the cost of time resolution.
target_shape[freq_bins, time_bins]— the fixed size each syllable spectrogram is resampled to. All downstream stages expect this shape; changing it invalidates any existing Stage A outputs.min_freq/max_freqFrequency band (Hz) retained in the spectrogram. Set to match the typical range of the species you are analysing.
max_durSyllables longer than this duration (seconds) are truncated or split.
save_inst_freq/save_group_delayAppend an instantaneous-frequency or group-delay channel alongside the magnitude spectrogram. Disabled by default; may improve clustering for some species.
duration_feature_weight0.0(default) disables duration as a feature. Setting to~1.0appends a synthetic “duration” time-bin column, giving UMAP a hint about syllable length.overwrite_existing(Stage A)false(default) skips any bird/song whose spectrogram HDF5 file is already present on disk, making re-runs fast. Set totrueto force recomputation — useful when the source audio has changed or you have altered spectrogram parameters that affect the output (e.g.nfft,target_shape).Note
Changing computational parameters (
nfft,target_shape, etc.) automatically produces a new run directory via the config hash, so you rarely need to setoverwrite_existing: trueunless you want to rebuild within the same run.
Embedding parameters (Stage C)
pipeline:
embedding:
n_neighbors: 20
min_dist: 0.5
metric: euclidean
n_components: 2
subsample_seed: 42
n_neighbors_grid: [5, 10, 20, 50, 100]
min_dist_grid: [0.01, 0.05, 0.1, 0.3, 0.5]
max_workers: null
overwrite: false
n_neighbors/min_dist/metricDefault UMAP hyperparameters, used as starting values for the grid search and as the single-run default.
n_neighbors_grid/min_dist_gridAll combinations are tried during the grid search. The best is chosen by the Stage D clustering metrics. Narrowing these lists speeds up Stage C significantly.
subsample_seedWhen the dataset is too large for UMAP, a random subset of syllables is drawn. This seed makes that draw reproducible.
max_workersnulllets the pipeline decide how many parallel workers to use based on available memory. Set an integer to cap usage on shared machines.overwrite(Stage C)false(default) skips any UMAP parameter combination whose.h5embedding file already exists on disk and is compatible with the current data (verified by sample count and hash overlap). This makes iterative runs much faster — only new grid combinations are computed.Set to
trueto force recomputation of all embeddings. This is needed when the flattened features have changed (e.g. after re-running Stage A or B with different parameters within the same run directory).Note
Like
overwrite_existing, this flag is excluded from the run-name hash — toggling it does not create a new run directory.
Clustering parameters (Stage D)
pipeline:
labelling:
metrics: [silhouette, dbi]
# metric_weights:
# silhouette: 2.0
# dbi: 1.0
replace_labels: false
hdbscan_grid: null
generate_cluster_pdf: false
max_workers: null
metricsClustering quality metrics used to rank HDBSCAN results. Supported values:
silhouette,dbi,ch,dunn,nmi.ch(Calinski-Harabasz) is excluded by default as it tends to favour large clusters.metric_weightsOptional per-metric weights for the composite ranking score. Comment out to use equal weights.
replace_labels(Stage D)false(default) skips re-clustering if label files already exist for a parameter combination. Set totrueto delete existing label files and force a full re-run — equivalent to theoverwriteflags above but scoped to Stage D.hdbscan_gridnulluses the built-in default grid (min_cluster_size ∈ [5, 20, 60],min_samples ∈ [5, 15]). Override with a custom grid:hdbscan_grid: min_cluster_size: [10, 30] min_samples: [5, 10]
Phenotyping parameters (Stage E)
pipeline:
phenotyping:
min_syllable_proportion: 0.02
repeat_significance_threshold: 0.4
dyad_threshold: 0.95
use_top_n_clusterings: 3
generate_plots: true
min_syllable_proportionSyllable types that appear in fewer than this fraction of songs are excluded from phenotype calculations.
repeat_significance_threshold/dyad_thresholdThresholds governing repeat and dyad detection in the transition matrix analysis.
use_top_n_clusteringsHow many top-ranked clusterings (from Stage D) are averaged to produce the final phenotype.
generate_plotsWrite phenotype summary figures to
<save_path>/<bird>/results/plots/.
Macaw server configuration
The pipeline can read source audio directly from the Macaw network server. By default it attempts to auto-detect the mount point based on your operating system:
macOS:
/Volumes/users/Linux:
/run/user/1005/gvfs/smb-share:server=macaw.local,share=users/Windows:
Z:\\
These defaults are lab-specific and may not match your setup — in
particular the Linux path encodes a user ID (1005) that differs between
accounts, and the Windows drive letter varies by machine.
To set your Macaw mount point explicitly, add it to config.yaml:
paths:
macaw_root: /Volumes/users # macOS (standard mount)
# macaw_root: /run/user/1234/gvfs/smb-share:server=macaw.local,share=users/
# macaw_root: Z:/ # Windows
If Macaw is not used in your workflow, set it to null (or leave it
absent) and the pipeline will simply skip any server-path resolution:
paths:
macaw_root: null
You can verify the detected root at runtime:
from song_phenotyping.tools.project_config import ProjectConfig
cfg = ProjectConfig.load()
print(cfg.macaw_root) # None if not mounted or not configured