Apr 1, 20249 min read

Article

Fine-Tuning Whisper Large v3 for Serbian Speech Workflows

A practical summary of adapting Whisper Large v3 for Serbian audio pipelines, focusing on dataset preparation, evaluation, and production tradeoffs.

AI
Speech
ML Ops
Notes & takeaways

Why fine-tune for Serbian

Out-of-the-box speech models can perform well for general transcription, but production quality usually depends on the domain and the language variety in your data. For Serbian, the biggest gains often come from improving consistency around terminology, accents, and noisy real-world recordings.

Stylized audio waveform illustration used in the original article

Data preparation matters more than model tweaks

Before training, we treated data preparation as the main quality lever:

  • normalized punctuation and casing rules
  • aligned audio segments to reduce long-context drift
  • filtered low-confidence or mismatched transcript pairs
  • kept a held-out validation set from realistic recordings

This step improved training stability and made evaluation results easier to trust.

Training strategy

We used a conservative fine-tuning setup rather than pushing for maximum adaptation in one run. Shorter experiments helped us compare changes quickly and spot regressions early. The focus was not just lowering loss, but improving word error rate on domain-relevant recordings.

Transcribed training setup (from the original code screenshot)

from datasets import Audio, load_dataset
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    WhisperForConditionalGeneration,
    WhisperProcessor,
)
import evaluate

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3", language="Serbian", task="transcribe")
wer_metric = evaluate.load("wer")

dataset = load_dataset("mozilla-foundation/common_voice_13_0", "sr", split="train+validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

def prepare_batch(batch):
    audio = batch["audio"]
    batch["input_features"] = processor.feature_extractor(
        audio["array"],
        sampling_rate=audio["sampling_rate"],
    ).input_features[0]
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
    return batch

dataset = dataset.map(prepare_batch, remove_columns=dataset.column_names)

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
model.config.forced_decoder_ids = None
model.generation_config.language = "serbian"
model.generation_config.task = "transcribe"

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-sr-large-v3",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    evaluation_strategy="steps",
    eval_steps=1000,
    save_steps=1000,
    logging_steps=50,
    predict_with_generate=True,
    generation_max_length=225,
    fp16=True,
    report_to=["tensorboard"],
)

Evaluation in context

A useful lesson was that aggregate metrics can hide operational problems. We reviewed samples from different categories, including speaker variability, noisy environments, and domain-specific vocabulary. This revealed where the model still needed help, especially on uncommon names and abbreviations.

Training curve and checkpoints

Training loss curve from the Whisper fine-tuning run

Training LossEpochStepValidation LossWER
0.04984.8110000.200420.1799
0.00429.6220000.322518.2395
0.000114.4230000.379917.2694
0.000119.2340000.396117.2694

Production considerations

Fine-tuning only solves part of the problem. The rest comes from the pipeline around the model:

  • preprocessing for audio quality and segmentation
  • post-processing for formatting and terminology normalization
  • monitoring to catch drift as new audio sources appear

Takeaway

For Serbian transcription workflows, Whisper Large v3 fine-tuning can deliver meaningful improvements when paired with disciplined data preparation and realistic evaluation. The best results came from iterative experiments tied to production-style samples, not benchmark-only optimization.

Need help with a similar problem?

Talk to RSA Team about applied AI and delivery work.