Energy Alone Is Not Enough — The Signal Processing Behind a Voice Activity Detector
Ninety percent accuracy sounds good until you see what the failing 10% does to your downstream pipeline.
The first version of the voice activity detector used a single metric: energy. Compute the root mean square of each audio chunk, convert to decibels, compare against a threshold. If the energy exceeds the threshold, the chunk contains speech. If it does not, it is silence.
This approach got me to about 90% accuracy. The remaining 10% nearly derailed the project.
The 90% trap
The energy detector worked well in controlled conditions. In a quiet room, with a single speaker talking directly into a phone, the energy difference between speech and silence was large and consistent. Threshold tuning was simple.
The failures appeared in environments with intermittent noise. Keyboard clicking — the percussive, sharp-edged kind — registered above the energy threshold. An elevator opening on the caller's floor produced a brief energy spike. A car passing outside, audible through a window, generated a sustained hum that the detector interpreted as speech.
Each false positive had the same downstream consequence: the detector captured a segment of noise, packaged it as a speech segment, and sent it to the transcription API. The transcription service either returned nothing — wasting the API call — or hallucinated a word or two from the noise. Those phantom transcriptions then reached the LLM, which tried to respond to them. The result was the assistant saying things like "I did not quite catch that, could you repeat?" when no one had spoken, or worse, responding to a word that was never said.
I spent more time on this problem than on any other part of the VAD. The initial approach — requiring a minimum duration before accepting a segment as speech — helped with very short transients but did not solve the fundamental issue. Extended keyboard typing lasted long enough to pass any reasonable minimum duration. So did traffic noise.
The problem was not the threshold. The problem was that energy measures loudness, and loudness is not what distinguishes speech from other sounds.
What makes speech different
Speech has specific frequency characteristics that most environmental noise does not share. Human voice concentrates its energy in the 300 to 3000 Hz range, with a center of mass that sits relatively low in the spectrum. The signal is quasi-periodic — voiced sounds like vowels have a repeating pattern that gives them pitch.
Keyboard clicks, by contrast, are broadband impulses. Their energy is spread across a wide frequency range, with a high center of mass. They cross zero very frequently — the waveform oscillates rapidly because the spectral energy is scattered, not concentrated.
Traffic noise is broadband but less impulsive. An elevator door has a brief, wide-spectrum transient followed by low-frequency rumble.
None of these share the frequency profile of speech. The question was how to measure that difference cheaply enough to run on every 20ms audio chunk in real time.
Three metrics from one pass
I settled on three measurements, all computable from the raw time-domain samples without an FFT. The decision to avoid an FFT was deliberate — full spectral decomposition on 160-sample chunks at telephony sample rates is tractable, but the approximations I found were fast enough and directionally correct for a binary speech/not-speech decision.
Energy remains the first gate. It is computed as the RMS of the samples, converted to decibels relative to full scale:
double sumSquares = 0;
for (float sample : audio) {
sumSquares += sample * sample;
}
double rms = Math.sqrt(sumSquares / audio.length);
double energy = 20 * Math.log10(rms + 1e-10); // dBFS
// 1e-10 prevents log(0) when the chunk is true silence
// Typical range: -50 dBFS (silence) to -10 dBFS (loud speech)
Energy values typically range from around -50 dBFS for silence to -10 dBFS for loud speech. The threshold is set during calibration — more on that in the next post.
Zero-crossing rate counts how many times the signal changes sign, divided by the number of samples:
int crossings = 0;
for (int i = 1; i < audio.length; i++) {
if ((audio[i] >= 0 && audio[i - 1] < 0)
|| (audio[i] < 0 && audio[i - 1] >= 0)) {
crossings++;
}
}
double zcr = (double) crossings / audio.length;
Speech, particularly voiced speech, has a low zero-crossing rate because the waveform follows the pitch period — relatively smooth oscillations at the fundamental frequency. Impulsive noise like keyboard clicks has a high zero-crossing rate because the energy is scattered across many frequencies, producing rapid oscillations. A typical voiced speech segment has a ZCR around 0.05 to 0.15. Keyboard clicking often exceeds 0.3.
Spectral centroid measures the center of mass of the frequency spectrum. For an FFT, this is the weighted average of the frequency bins. I computed an approximation directly from the time-domain signal using the ratio of the sum of absolute differences to the sum of absolute values:
double numerator = 0, denominator = 0;
for (int i = 1; i < audio.length; i++) {
numerator += Math.abs(audio[i] - audio[i - 1]);
denominator += Math.abs(audio[i]);
}
double spectralCentroid = (denominator > 0)
? (sampleRate / (2 * Math.PI)) * (numerator / denominator)
: 0;
// High-frequency content produces larger sample-to-sample differences
// relative to signal amplitude — which is what makes this approximation work
This approximation works because high-frequency content produces larger sample-to-sample differences relative to the signal amplitude. Speech concentrates around 1000–1500 Hz. Keyboard clicks push the centroid above 2500 Hz. The approximation is not as precise as an FFT-derived centroid, but it is fast and directionally correct — sufficient for a binary decision on 20ms chunks.
The decision logic
The three metrics combine in a specific way. Energy is a necessary condition — if the chunk is quiet, it is not speech regardless of its frequency characteristics. But energy alone is not sufficient. The chunk must also have frequency characteristics consistent with speech, which means either a low zero-crossing rate or a low spectral centroid:
boolean isVoice(AudioCalibration cal, AudioMetrics m) {
return m.energy() > cal.energyThreshold()
&& (m.zcr() < cal.zcrThreshold()
|| m.spectralCentroid() < cal.spectralCentroidThreshold());
}
The OR between ZCR and spectral centroid is deliberate. Some speech segments — particularly unvoiced consonants like "s" or "sh" — have a high zero-crossing rate but still a relatively low spectral centroid. Requiring both conditions to be met would reject these segments. Requiring either one preserves sensitivity to the full range of speech sounds while still rejecting broadband noise.
Comparison of audio metrics for speech versus common noise sources, showing why energy alone cannot distinguish them
Energy (dBFS) — higher is louder
Red line = calibrated threshold. Keyboard and door slam both exceed it.
Energy alone: false positivesZero-crossing rate — higher means noisier
Green line = threshold. Keyboard clicks cross zero rapidly — speech does not.
ZCR filters keyboard clicksSpectral centroid — higher means brighter/sharper
Speech energy concentrates in lower frequencies. Impulsive noise spreads wider.
Spectral centroid filters sharp transientsCombined decision logic
energy > thresholdAND (zcr < zcrThreshold OR spectralCentroid < scThreshold)
Energy gates on loudness. ZCR and spectral centroid filter by frequency character.
Combined: no false positivesWhat this solved and what it did not
The three-metric approach eliminated the keyboard clicking problem entirely. Door slams, which are brief broadband transients with a high ZCR, were consistently rejected. Traffic noise — sustained but spectrally broad — was caught by the spectral centroid threshold.
There was a category of noise that still got through: sounds with both high energy and frequency characteristics overlapping with speech. The most persistent example was music playing in the background. Music has pitch, it has concentrated spectral energy, it has moderate zero-crossing rates. This was uncommon enough in my telephony use case that I chose not to address it at this stage, but it is a known limitation.
The other remaining issue was threshold sensitivity. The thresholds for all three metrics depended on the acoustic environment. A threshold that worked in one room failed in another. Hardcoding values meant the detector either missed quiet speech in noisy environments or triggered on ambient sound in quiet ones.
That problem — making the thresholds adapt to the environment — is what auto-calibration solves. The next post covers how running averages during an initial silence window provide the baseline, and how the margins above that baseline are tuned to balance sensitivity against false positive rate.