Framing the Problem
The project requirement was straightforward on the surface: extract specific data fields from scanned document images. In practice, this meant building a two-stage pipeline — first, locate regions of interest (ROIs) in the document image using an object detector, then pass those regions to an OCR engine for text extraction.
The temptation in this type of project is to treat model selection as the primary decision. It isn't.
The Detector as Part of a Larger System
YOLOv3 was chosen not because it's the most accurate detector available, but because its speed/accuracy tradeoff fit the operational constraints: document volume, available hardware, and latency requirements. A slower, more accurate detector would have been worse in practice because it would have created a bottleneck that changed how the system was used.
The key insight: the detector's job is to give the OCR engine a clean, well-cropped input. The detector doesn't need perfect recall across all possible document layouts — it needs high precision on the specific classes that matter, with clean bounding boxes that don't clip text.
Labeling Quality and Class Strategy
We labeled training data for five document classes: header blocks, date fields, reference number fields, signature blocks, and table rows. The labeling strategy mattered more than the raw number of labeled examples.
Lessons from the labeling process:
- Tight bounding boxes win. Loose boxes that include surrounding whitespace degrade OCR accuracy by introducing irrelevant visual context.
- Class consolidation helped. We started with twelve fine-grained classes and merged seven of them after finding that the distinctions weren't meaningful for downstream extraction.
- Negative examples matter. Including images with no relevant fields reduced false positive rate significantly.
We used 1,400 labeled documents for training, 200 for validation. Augmentation (rotation within ±5°, brightness variation, slight blur) improved real-world performance noticeably — scanned documents vary more than you expect.
OCR Handoff and Post-Processing
Detection outputs a bounding box. OCR inputs a cropped image. The gap between these two steps is where most production issues originate.
Padding the crop: We added 4–6px of padding around each detected bounding box before passing to OCR. This prevents text clipping at the edges, which causes character-level errors that are hard to catch downstream. Deskewing: Even slight rotation in the cropped image (1–2°) degrades OCR accuracy on printed text. We added a fast deskew step using Hough transform before each OCR call. Engine choice and tuning: Tesseract with--psm 7 (single text line) worked better than --psm 6 (block) for our field-level crops. The OCR engine tuning was worth more time than we expected.
End-to-End Quality Measurement
We made an early decision to measure extraction quality end-to-end, not just detector mAP.
The reason: a detector with 92% mAP but poor bounding box tightness can produce worse final extraction accuracy than a detector with 87% mAP and cleaner boxes. Detection metrics don't capture everything that matters for downstream quality.
We built a simple evaluation harness: ground-truth extracted values (manually verified) compared against pipeline outputs for 500 documents. This gave us a real-world extraction accuracy number that drove all subsequent improvement work.
ROI Quality Is the Leverage Point
When we looked at extraction errors, the distribution was:
- 55% caused by ROI quality issues (bad crop, clipped text, skew)
- 25% caused by OCR engine errors on ambiguous characters
- 15% caused by missing detections
- 5% caused by post-processing logic errors
This breakdown shifted our priorities entirely. Improving ROI quality — tighter boxes, better padding, consistent deskewing — produced the largest accuracy gains with the least model retraining.
Conclusion
Building a detection-plus-OCR pipeline that actually works in production is an integration problem, not just a model problem. The detector, the preprocessing, the crop strategy, the OCR configuration, and the post-processing validation all contribute to final accuracy.
ROI quality and end-to-end validation often matter more than the choice of detector architecture. Measure what the system actually produces, not just what the model technically achieves.
