| # BirdNET Audio Prediction Script |
|
|
| This script loads a WAV file and uses the BirdNET ONNX model to predict bird species from audio recordings. It supports both single-window analysis (first 3 seconds) and moving window analysis (entire file) with species name mapping. |
|
|
| ## Features |
|
|
| - **Species Name Mapping**: Uses `BirdNET_GLOBAL_6K_V2.4_Labels.txt` to display actual bird species names instead of class indices |
| - **Moving Window Analysis**: Analyzes entire audio files using overlapping 3-second windows |
| - **Single Window Mode**: Quick analysis of just the first 3 seconds |
| - **Configurable Parameters**: Adjustable confidence thresholds, overlap ratios, and result counts |
| - **Detection Summary**: Comprehensive overview of all detections with timestamps and confidence scores |
|
|
| ## Requirements |
|
|
| - Python 3.7+ |
| - The model expects audio input of exactly 3 seconds duration at 48kHz sample rate (144,000 samples) |
| - BirdNET labels file: `BirdNET_GLOBAL_6K_V2.4_Labels.txt` |
|
|
| ## Installation |
|
|
| Install the required dependencies: |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| Required packages: |
|
|
| - `numpy>=1.21.0` |
| - `librosa>=0.9.0` |
| - `onnxruntime>=1.12.0` |
|
|
| ## Usage |
|
|
| ### Moving Window Analysis (Full File) |
|
|
| Analyze the entire audio file with overlapping windows: |
|
|
| ```bash |
| python predict_audio.py audio.wav |
| ``` |
|
|
| ### Single Window Analysis (First 3 seconds only) |
|
|
| Quick analysis of just the beginning: |
|
|
| ```bash |
| python predict_audio.py audio.wav --single-window |
| ``` |
|
|
| ### Advanced Usage Examples |
|
|
| ```bash |
| # High sensitivity analysis with more results |
| python predict_audio.py audio.wav --confidence 0.1 --top-k 15 |
| |
| # Fine-grained analysis with 75% window overlap |
| python predict_audio.py audio.wav --overlap 0.75 --confidence 0.3 |
| |
| # Custom model and labels files |
| python predict_audio.py audio.wav --model custom_model.onnx --labels custom_labels.txt |
| ``` |
|
|
| ### Command Line Arguments |
|
|
| - `audio_file`: Path to the WAV audio file (required) |
| - `--model`: Path to the ONNX model file (default: `model.onnx`) |
| - `--labels`: Path to the species labels file (default: `BirdNET_GLOBAL_6K_V2.4_Labels.txt`) |
| - `--top-k`: Number of top predictions to show (default: 5) |
| - `--overlap`: Window overlap ratio 0.0-1.0 (default: 0.5 = 50% overlap) |
| - `--confidence`: Minimum confidence threshold for detections (default: 0.1) |
| - `--batch-size`: Batch size for inference processing (default: 128) |
| - `--single-window`: Analyze only first 3 seconds instead of full file |
|
|
| ## Output Examples |
|
|
| ### Single Window Output |
|
|
| ``` |
| Loading labels from: BirdNET_GLOBAL_6K_V2.4_Labels.txt |
| Loaded 6522 species labels |
| Loading ONNX model: model.onnx |
| Loading first 3 seconds of audio file: bird_recording.wav |
| Audio loaded successfully. Shape: (144000,) |
| Running inference on single window... |
| |
| Top 5 predictions for first 3 seconds: |
| 1. American Robin: 0.892456 |
| 2. Song Sparrow: 0.234567 |
| 3. House Finch: 0.123789 |
| 4. Northern Cardinal: 0.089234 |
| 5. Blue Jay: 0.056789 |
| ``` |
|
|
| ### Moving Window Output |
|
|
| ``` |
| Loading labels from: BirdNET_GLOBAL_6K_V2.4_Labels.txt |
| Loaded 6522 species labels |
| Loading ONNX model: model.onnx |
| Loading full audio file: long_recording.wav |
| Audio loaded successfully. Duration: 45.32 seconds |
| Creating windows with 50% overlap... |
| Created 28 windows of 3 seconds each |
| Running inference on all windows... |
| Processing window 1/28 (t=0.0s) |
| Processing window 11/28 (t=15.0s) |
| Processing window 21/28 (t=30.0s) |
| Completed inference on 28 windows |
| Analyzing detections with confidence threshold 0.1... |
| |
| === DETECTION SUMMARY === |
| Audio duration: 45.32 seconds |
| Windows analyzed: 28 |
| Species detected (>0.10 confidence): 4 |
| |
| Top detections: |
| |
| American Robin |
| Max confidence: 0.892456 |
| Detections: 12 |
| Time range: 0.0s - 18.0s |
| 1.5s: 0.892456 |
| 3.0s: 0.845231 |
| 4.5s: 0.723456 |
| |
| Song Sparrow |
| Max confidence: 0.567890 |
| Detections: 6 |
| Time range: 22.5s - 36.0s |
| 24.0s: 0.567890 |
| 25.5s: 0.445678 |
| 27.0s: 0.334567 |
| |
| House Finch |
| Max confidence: 0.345678 |
| Detections: 3 |
| Time range: 38.5s - 42.0s |
| 39.0s: 0.345678 |
| ``` |
|
|
| ## Technical Details |
|
|
| ### Model Input/Output |
|
|
| - **Input**: Audio array of shape `[batch_size, 144000]` (3 seconds at 48kHz) |
| - **Output**: Classification scores for 6522 bird species |
|
|
| ### Audio Preprocessing |
|
|
| The script automatically handles: |
|
|
| - Loading audio files with librosa (supports WAV, MP3, FLAC, etc.) |
| - Resampling to 48kHz if necessary |
| - Padding with zeros or truncating to exactly 3 seconds (144,000 samples) |
| - Converting to float32 format |
|
|
| ### Moving Window Analysis |
|
|
| - Creates overlapping 3-second windows from the full audio |
| - Default 50% overlap means windows at 0s, 1.5s, 3s, 4.5s, etc. |
| - Higher overlap (e.g., 75%) provides more fine-grained analysis but takes longer |
| - Each window is analyzed independently, then results are aggregated |
|
|
| ### Batch Processing |
|
|
| - Windows are processed in configurable batches (default: 128 windows per batch) |
| - Significantly improves performance by utilizing vectorized operations |
| - Automatically handles memory management and progress reporting |
| - Optimal batch size depends on available system memory and model complexity |
|
|
| ### Species Labels |
|
|
| - Uses the official BirdNET labels file with 6522 species |
| - Format: `Scientific_name_Common Name` per line |
| - Script extracts and displays the common names (part after underscore) |
|
|
| ## Performance Tips |
|
|
| - Use `--single-window` for quick identification of prominent species |
| - Increase `--overlap` (0.75-0.9) for detailed analysis of complex recordings |
| - Lower `--confidence` (0.05-0.1) to catch weaker signals |
| - Higher `--confidence` (0.3-0.5) for only very confident detections |
| - Use `--top-k 1` to see only the most confident detection per analysis |
| - **Batch Processing**: Default `--batch-size 128` provides optimal performance |
| - Increase batch size (256, 512) if you have more GPU/RAM memory |
| - Decrease batch size (32, 64) if you encounter memory issues |
| - Batch processing significantly improves performance on longer audio files |
|
|