Total records
111,143
A deep-learning pipeline combines SMP U-Net restoration with Conv-Transformer OCR, then executes inference over distributed Hadoop workers to process large document batches in parallel.
تاریخی اور سرکاری اردو دستاویزات کی خودکار بحالی اور ڈیجیٹل تبدیلی
Total records
111,143
Final CER
8.14%
Final WER
18.33%
Restoration quality
PSNR 34.52 dB / SSIM 0.9932
Section 2
111,143 records combine printed Nastaleeq and handwritten Urdu. Inputs are standardized to 1 x 128 x 2048 grayscale tensors before restoration and OCR.
Train
77,800
70%
Validation
16,671
15%
Test
16,672
15%
Applied transformations used to stress-test restoration robustness
Section 3
Noisy Input
Degraded scanned page
SMP U-Net Restoration
ResNet34 encoder
Restored Image
PSNR 34.52 / SSIM 0.9932
Conv-Transformer OCR
Urdu sequence decoding
SMP U-Net + ResNet34 encoder
Conv-Transformer sequence recognition
Section 4
Loss
0.2075
CER
8.14%
WER
18.33%
UHWR handwritten mini-sample remains harder: CER 29.89% and WER 55.29%, reinforcing handwriting as the current difficult regime.
Section 5
Uploaded batches enter HDFS, MapReduce shards page groups across workers, each worker runs the same preprocessing → restoration → OCR block, and reducer stages merge outputs into ordered Urdu text files.
Graph composition from distributed architecture
Interactive demo simulator
Input panel
Processing panel
Results panel