June 8, 2026
What Whisper Hears When It Listens to Dolphins
Why we did this
The files
| Species | Type | Source | Size |
|---|---|---|---|
| Bottlenose dolphin | Social vocalisations underwater | SoundBible | 126 KB |
| Bottlenose dolphin | Click burst | SoundBible | 69 KB |
| Allied male bottlenose | Burst-pulse coordination call | Zenodo 4943486 | 965 KB |
| Allied male bottlenose | Burst-pulse coordination call | Zenodo 4943486 | 1.4 MB |
| Bottlenose dolphin | Echolocation clicks >48kHz | Zenodo 5138356 | 419 KB |
| Bottlenose dolphin | Broadband clicks >24kHz | Zenodo 5138356 | 1.1 MB |
| Heaviside's dolphin | Relaxed acoustic behaviour | Zenodo 4967891 | 5.9 MB |
What Whisper said
Model: openai/whisper-large-v3 · AI output — not a translation
What it means
"you" — 3 files
Dolphin whistles and sustained vocalisations carry strong harmonic energy around 1kHz — the same frequency range as the long vowel /uː/ in human speech. Whisper is a human speech model. It hears tone, finds the closest phoneme in its vocabulary, and outputs the vowel that fits. This is correct behaviour from a model doing its job. It just wasn't trained for this job.
"Mwah!" — clicks
Echolocation clicks are percussive: brief, broadband energy bursts with no tonal structure. In human phoneme space, this maps cleanly to labial consonants — the sounds made when lips come together (/m/, /w/, /b/). The model is actually responding to real acoustic structure here. Clicks sound like consonants because both are brief transient events. Wrong species, right instinct.
"Thank you." / "All right." — allied male pops
This is the most interesting result. The burst-pulse "pop" calls used by allied male bottlenose dolphins during cooperative behaviour have a two-beat rhythmic structure — a short initial pop followed by a longer burst. Whisper hears a two-syllable prosodic pattern and outputs brief social acknowledgements. That's not random. Both dolphin burst-pulse pops and human phrases like "thank you" serve real-time social coordination functions. The temporal similarity likely reflects shared pragmatic constraints on alliance maintenance signals across species. Whisper stumbled onto something real, by accident.
The takeaway
Whisper is useless as a dolphin communication tool. That was expected — it was trained on human speech and has never encountered a dolphin in its training data. But the outputs are not random noise. Each one is a coherent response to real acoustic features in the recordings: frequency profiles, temporal patterns, energy distributions. This is precisely the problem DolphinGemma is designed to solve. Instead of forcing dolphin vocalisations through human phoneme detectors, it learns dolphin-specific acoustic tokens from 40 years of annotated field recordings. The gap between what Whisper hears and what DolphinGemma eventually identifies will be the clearest possible demonstration of why purpose-built models matter. We'll run this exact same experiment again when DolphinGemma becomes publicly available. That comparison will be the centrepiece of this research log.
Next step
Repeat with DolphinGemma when Google releases public weights. Compare outputs side by side. Write it up here.