Distributed Urdu OCR

Restore and digitize degraded Urdu records at national scale.

A deep-learning pipeline combines SMP U-Net restoration with Conv-Transformer OCR, then executes inference over distributed Hadoop workers to process large document batches in parallel.

تاریخی اور سرکاری اردو دستاویزات کی خودکار بحالی اور ڈیجیٹل تبدیلی

Total records

111,143

Final CER

8.14%

Final WER

18.33%

Restoration quality

PSNR 34.52 dB / SSIM 0.9932

Section 2

Data + Preprocessing

111,143 records combine printed Nastaleeq and handwritten Urdu. Inputs are standardized to 1 x 128 x 2048 grayscale tensors before restoration and OCR.

Dataset composition

MMU-OCR-21 (Printed Nastaleeq)100,541
NUST-UHWR (Handwritten Urdu)10,602

Train / Val / Test split

Train

77,800

70%

Validation

16,671

15%

Test

16,672

15%

Standardization pipeline

  1. 1Grayscale conversion
  2. 2Resize to fixed height: 128
  3. 3Aspect-ratio-preserving right padding to width: 2048

Synthetic degradations used

Applied transformations used to stress-test restoration robustness

Gaussian BlurGaussian NoiseSalt & PepperLow ContrastAffine SkewPadding Crop

Section 3

Deep Learning Pipeline

Noisy Input

Degraded scanned page

SMP U-Net Restoration

ResNet34 encoder

Restored Image

PSNR 34.52 / SSIM 0.9932

Conv-Transformer OCR

Urdu sequence decoding

Model block

Restoration block

SMP U-Net + ResNet34 encoder

Params 24,430,097MSE 0.000574PSNR 34.52 dBSSIM 0.9932
  • ImageNet-pretrained encoder for stable low-level feature extraction.
  • Restores contrast and edge fidelity before OCR decoding.
  • Optimized for degraded archive and government document scans.
Model block

OCR block

Conv-Transformer sequence recognition

Params 7,337,197Vocab 1737 CNN blocks3 Encoder / 3 DecoderCross-Entropy objective
  • CNN backbone encodes visual tokens from restored Urdu text lines.
  • d_model=256, nhead=8, feedforward=1024
  • Transformer decoder outputs character sequence without CTC.

Section 4

Results + Benchmarking

Final OCR test metrics

Loss

0.2075

CER

8.14%

WER

18.33%

DL vs Tesseract benchmark

CER improvement: 58.29 ppWER improvement: 82.83 pp
  • Tesseract: CER 62.04%, WER 93.15%
  • DL model: CER 3.75%, WER 10.32%

Error-rate comparison

Handwriting challenge note

UHWR handwritten mini-sample remains harder: CER 29.89% and WER 55.29%, reinforcing handwriting as the current difficult regime.

Section 5

Distributed Inference Architecture

Execution flow

Uploaded batches enter HDFS, MapReduce shards page groups across workers, each worker runs the same preprocessing → restoration → OCR block, and reducer stages merge outputs into ordered Urdu text files.

Input batchMapper nodesParallel inferenceReducerOutput bundle

Tech stack

React/Next.js
FastAPI
PyTorch
Hadoop/HDFS/MapReduce
Docker
Runport

Graph composition from distributed architecture

Interactive demo simulator

Upload → Process → Review Urdu OCR Output

Input panel

Upload images

Line segmentation

Processing panel

Distributed scheduler (2 Hadoop data nodes)

Backend mode
Upload a zip to start the lifecycle: idle → uploading → processing → completed | failed.

Results panel

Output mapping and Urdu text viewer

Run a job to view 1:1 output mapping (A.pdf → A.txt), then inspect line-level OCR text.