YOLOv3 ROI Detection with OCR: Building a Reliable Extraction Pipeline

Designing a detection-plus-OCR pipeline is less about model choice alone and more about ROI quality, post-processing, and error handling in real documents.

Computer Vision

OCR

Notes & takeaways

Problem framing

When extracting text from structured images or scanned documents, OCR quality depends heavily on how well the region of interest (ROI) is detected first. In our workflow, YOLOv3 handled ROI detection while a separate OCR layer performed text extraction.

Original YOLOv3-themed visual from the earlier article

ROI detection as a system component

The detector was only one part of the pipeline. We designed around operational needs:

predictable ROI coordinates for downstream processing
tolerance to lighting, skew, and scan noise
fast inference for batch processing
fallback behavior when detections are weak

This perspective helped us avoid overfitting to ideal samples.

Dataset labeling and class strategy

Labeling quality directly affected OCR success. We used ROI classes that matched downstream extraction tasks instead of overly granular classes that increased ambiguity. Consistent bounding box rules improved both detector learning and later text cropping.

LabelImg annotation workflow for ROI fields on a BiH ID card sample

YOLOv3 architecture context

The implementation used YOLOv3 because it provided a practical speed/accuracy tradeoff and a simple training workflow for custom ROI classes. For document ROI extraction, the objective is not generic object detection benchmarking, but stable and repeatable crops for downstream OCR.

YOLOv3 network architecture reference diagram used in the original post

Transcribed setup commands (replacing the code screenshot)

# Clone and build Darknet
git clone https://github.com/AlexeyAB/darknet
cd darknet

sed -i 's/OPENCV=0/OPENCV=1/' Makefile
sed -i 's/GPU=0/GPU=1/' Makefile
sed -i 's/CUDNN=0/CUDNN=1/' Makefile
sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile
make

# Copy base config and set class count
cp cfg/yolov3.cfg cfg/yolov3_training.cfg

# classes = 5
# max_batches = classes * 2000
# steps = 80%, 90% of max_batches
# filters = (classes + 5) * 3 = 30

sed -i 's/batch=1/batch=64/' cfg/yolov3_training.cfg
sed -i 's/subdivisions=1/subdivisions=16/' cfg/yolov3_training.cfg
sed -i 's/max_batches = 500200/max_batches = 4000/' cfg/yolov3_training.cfg
sed -i 's/steps=400000,450000/steps=3000,3600/' cfg/yolov3_training.cfg

sed -i '610 s@classes=80@classes=5@' cfg/yolov3_training.cfg
sed -i '696 s@classes=80@classes=5@' cfg/yolov3_training.cfg
sed -i '783 s@classes=80@classes=5@' cfg/yolov3_training.cfg

sed -i '603 s@filters=255@filters=30@' cfg/yolov3_training.cfg
sed -i '689 s@filters=255@filters=30@' cfg/yolov3_training.cfg
sed -i '776 s@filters=255@filters=30@' cfg/yolov3_training.cfg

mkdir data/backup
mkdir data/obj

# Class list + dataset metadata
echo -e "serial_number\nsurname\ngiven_name\nsex\ndob" > data/obj.names
echo -e "classes = 5\ntrain = data/train.txt\nvalid = data/test.txt\nnames = data/obj.names\nbackup = data/backup" > data/obj.data

# Download pretrained backbone
wget https://pjreddie.com/media/files/darknet53.conv.74

import glob

images = glob.glob("data/obj/*.jpg")
with open("data/train.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(images))

# Start training
./darknet detector train data/obj.data cfg/yolov3_training.cfg darknet53.conv.74

OCR handoff and post-processing

Even good detections need cleanup before OCR. We added image preprocessing and rule-based validation to improve extraction reliability:

crop padding to prevent clipped characters
contrast adjustments for noisy scans
regex and field-level checks for expected formats
retry logic with alternate preprocessing settings

Monitoring training and validating end-to-end output

Darknet loss curve during training

Detected ROI fields and OCR crop outputs used for validation

Measuring pipeline quality

End-to-end accuracy was more meaningful than detector metrics alone. A detection could count as correct while still producing OCR errors if the crop was too tight or misaligned. We tracked both detection performance and final extraction success to guide improvements.

Takeaway

YOLOv3 plus OCR can work well for structured extraction tasks when the pipeline is designed holistically. ROI quality, preprocessing, and validation logic usually determine real-world performance as much as the detector itself.

Need help with a similar problem?

Talk to RSA Team about applied AI and delivery work.