Research Log

The raw work.

What we tried, what we found, what it means. Written for people following this project from the beginning — not the polished version, the real one.

001

June 8, 2026

What Whisper Hears When It Listens to Dolphins

Audio AnalysisWhisperFirst Experiment

Why we did this

We have a demo page that lets visitors upload dolphin audio and get AI analysis. Before we can claim it works, we need to know what it actually does with real dolphin recordings. So we pulled 7 files from scientific repositories — Zenodo, mostly — and fed them through our Whisper backend one by one. This is the first time we have run any dolphin audio through our own infrastructure. Entry 001.

The files

SpeciesTypeSourceSize
Bottlenose dolphinSocial vocalisations underwaterSoundBible126 KB
Bottlenose dolphinClick burstSoundBible69 KB
Allied male bottlenoseBurst-pulse coordination callZenodo 4943486965 KB
Allied male bottlenoseBurst-pulse coordination callZenodo 49434861.4 MB
Bottlenose dolphinEcholocation clicks >48kHzZenodo 5138356419 KB
Bottlenose dolphinBroadband clicks >24kHzZenodo 51383561.1 MB
Heaviside's dolphinRelaxed acoustic behaviourZenodo 49678915.9 MB

What Whisper said

Dolphins underwater"you"
Dolphin clicks"Mwah!"
Allied male pops A"Thank you."
Allied male pops B"All right."
Clicks >48kHz"you"
Broadband clicks"Thank you."
Heaviside's dolphin"you"

Model: openai/whisper-large-v3 · AI output — not a translation

What it means

"you" — 3 files

Dolphin whistles and sustained vocalisations carry strong harmonic energy around 1kHz — the same frequency range as the long vowel /uː/ in human speech. Whisper is a human speech model. It hears tone, finds the closest phoneme in its vocabulary, and outputs the vowel that fits. This is correct behaviour from a model doing its job. It just wasn't trained for this job.

"Mwah!" — clicks

Echolocation clicks are percussive: brief, broadband energy bursts with no tonal structure. In human phoneme space, this maps cleanly to labial consonants — the sounds made when lips come together (/m/, /w/, /b/). The model is actually responding to real acoustic structure here. Clicks sound like consonants because both are brief transient events. Wrong species, right instinct.

"Thank you." / "All right." — allied male pops

This is the most interesting result. The burst-pulse "pop" calls used by allied male bottlenose dolphins during cooperative behaviour have a two-beat rhythmic structure — a short initial pop followed by a longer burst. Whisper hears a two-syllable prosodic pattern and outputs brief social acknowledgements. That's not random. Both dolphin burst-pulse pops and human phrases like "thank you" serve real-time social coordination functions. The temporal similarity likely reflects shared pragmatic constraints on alliance maintenance signals across species. Whisper stumbled onto something real, by accident.

The takeaway

Whisper is useless as a dolphin communication tool. That was expected — it was trained on human speech and has never encountered a dolphin in its training data. But the outputs are not random noise. Each one is a coherent response to real acoustic features in the recordings: frequency profiles, temporal patterns, energy distributions. This is precisely the problem DolphinGemma is designed to solve. Instead of forcing dolphin vocalisations through human phoneme detectors, it learns dolphin-specific acoustic tokens from 40 years of annotated field recordings. The gap between what Whisper hears and what DolphinGemma eventually identifies will be the clearest possible demonstration of why purpose-built models matter. We'll run this exact same experiment again when DolphinGemma becomes publicly available. That comparison will be the centrepiece of this research log.

Next step

Repeat with DolphinGemma when Google releases public weights. Compare outputs side by side. Write it up here.

More entries as the work progresses.

Following along? Get in touch.