Article
Fine-Tuning Whisper Large v3 for Serbian Speech Workflows
A practical summary of adapting Whisper Large v3 for Serbian audio pipelines, focusing on dataset preparation, evaluation, and production tradeoffs.
Why fine-tune for Serbian
Out-of-the-box speech models can perform well for general transcription, but production quality usually depends on the domain and the language variety in your data. For Serbian, the biggest gains often come from improving consistency around terminology, accents, and noisy real-world recordings.

Data preparation matters more than model tweaks
Before training, we treated data preparation as the main quality lever:
- normalized punctuation and casing rules
- aligned audio segments to reduce long-context drift
- filtered low-confidence or mismatched transcript pairs
- kept a held-out validation set from realistic recordings
This step improved training stability and made evaluation results easier to trust.
Training strategy
We used a conservative fine-tuning setup rather than pushing for maximum adaptation in one run. Shorter experiments helped us compare changes quickly and spot regressions early. The focus was not just lowering loss, but improving word error rate on domain-relevant recordings.
Transcribed training setup (from the original code screenshot)
from datasets import Audio, load_dataset
from transformers import (
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
WhisperForConditionalGeneration,
WhisperProcessor,
)
import evaluate
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3", language="Serbian", task="transcribe")
wer_metric = evaluate.load("wer")
dataset = load_dataset("mozilla-foundation/common_voice_13_0", "sr", split="train+validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
def prepare_batch(batch):
audio = batch["audio"]
batch["input_features"] = processor.feature_extractor(
audio["array"],
sampling_rate=audio["sampling_rate"],
).input_features[0]
batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
return batch
dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names)
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
model.config.forced_decoder_ids = None
model.generation_config.language = "serbian"
model.generation_config.task = "transcribe"
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-sr-large-v3",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
evaluation_strategy="steps",
eval_steps=1000,
save_steps=1000,
logging_steps=50,
predict_with_generate=True,
generation_max_length=225,
fp16=True,
report_to=["tensorboard"],
)
Evaluation in context
A useful lesson was that aggregate metrics can hide operational problems. We reviewed samples from different categories, including speaker variability, noisy environments, and domain-specific vocabulary. This revealed where the model still needed help, especially on uncommon names and abbreviations.
Training curve and checkpoints

| Training Loss | Epoch | Step | Validation Loss | WER |
|---|---|---|---|---|
| 0.0498 | 4.81 | 1000 | 0.2004 | 20.1799 |
| 0.0042 | 9.62 | 2000 | 0.3225 | 18.2395 |
| 0.0001 | 14.42 | 3000 | 0.3799 | 17.2694 |
| 0.0001 | 19.23 | 4000 | 0.3961 | 17.2694 |
Production considerations
Fine-tuning only solves part of the problem. The rest comes from the pipeline around the model:
- preprocessing for audio quality and segmentation
- post-processing for formatting and terminology normalization
- monitoring to catch drift as new audio sources appear
Takeaway
For Serbian transcription workflows, Whisper Large v3 fine-tuning can deliver meaningful improvements when paired with disciplined data preparation and realistic evaluation. The best results came from iterative experiments tied to production-style samples, not benchmark-only optimization.