Fine-Tuning Whisper Large v3 for Serbian Speech Workflows

Why Fine-Tune for Serbian?

Whisper Large v3 performs remarkably well on major European languages, but Serbian presents specific challenges: morphological complexity, a relatively limited presence in the training corpus, and regional pronunciation variation across Ekavian and Ijekavian dialects. Off-the-shelf transcription accuracy was acceptable for general content but insufficient for domain-specific audio — legal, medical, and government recordings where word-level precision matters.

Fine-tuning gave us a way to close that gap without building from scratch.

Dataset Preparation Is the Real Work

The most critical insight from this project: model configuration matters far less than data quality. We spent roughly 70% of total project time on dataset preparation:

Sourcing diverse audio: broadcast media, interview recordings, read speech corpora
Manual transcription validation for ambiguous segments
Normalization pipeline: punctuation handling, number formatting, dialect normalization decisions
Deduplication and train/eval split strategy

We found that a smaller, carefully curated dataset consistently outperformed a larger dataset with noisy labels. 200 hours of clean, diverse audio beat 500 hours of mixed-quality recordings in every evaluation run.

Training Strategy: Conservative and Iterative

We kept the training setup deliberate:

Frozen encoder layers for the first training phase (only fine-tuning the decoder)
Small learning rate with warmup: 1e-5 peaking at 5e-5
Gradient accumulation to handle memory constraints on available hardware
Early stopping based on validation WER (Word Error Rate)

The key decision was not to over-train. Models trained beyond the point of validation WER improvement showed worse real-world performance — they memorized the training distribution rather than generalizing to live audio conditions.

Evaluation in Real Context

Benchmark WER on a held-out test set is necessary but not sufficient. We ran parallel evaluations on:

Clean studio recordings
Telephone-quality audio (8kHz, compressed)
Noisy environments (background noise, multiple speakers)
Domain-specific vocabulary density tests

The gap between benchmark WER and real-world performance was significant in the telephone and noisy categories. This informed preprocessing choices: we added a normalization pass on input audio before inference to reduce that gap in production.

Production Considerations

Deploying a fine-tuned Whisper model raises several operational questions:

Preprocessing: Apply gain normalization and silence trimming before inference. Inconsistent input levels are the single biggest source of degraded accuracy in production. Post-processing: A domain-specific vocabulary correction layer (simple dictionary lookups for known proper nouns and technical terms) adds meaningful accuracy at near-zero cost. Monitoring: Track per-segment confidence scores and flag low-confidence outputs for human review. The model is right to be uncertain when it says it is. Latency: Large v3 is not fast. For real-time applications, consider a distilled or quantized variant. For async workflows, Large v3 accuracy at acceptable throughput is achievable with batching.

Conclusion

The best results came from iterative experiments with realistic audio samples. No single training decision made the difference — it was the combination of clean data, conservative training, domain-aware evaluation, and production-ready preprocessing that produced a system worth deploying.

Fine-tuning Whisper for a lower-resource language is tractable. The investment is primarily in data curation, not in model complexity.