Quit Emailing Yourself

How we built a custom vision LLM to improve document processing at Grab

6 min read | Saved February 14, 2026 | Copied!

vision-llm 🤖 document-processing 🤖 ocr 🤖 southeast-asia 🤖 machine-learning 🤖

Do you care about this?

Grab developed a specialized Vision LLM to enhance document processing for eKYC in Southeast Asia. The project focused on improving OCR accuracy for diverse languages and document formats, ultimately creating a lightweight model tailored to their needs.

If you do, here's more

Grab developed a custom Vision LLM to enhance document processing, particularly for eKYC tasks. The challenge stemmed from the variety of Southeast Asian languages and document formats that traditional OCR systems struggled to handle. While proprietary LLMs offered power, they often couldn't accurately interpret these languages and had issues with errors and latency. Open-source Vision LLMs showed promise but lacked the necessary accuracy for production use. This led Grab to fine-tune existing models, ultimately creating a specialized Vision LLM tailored for their specific needs.

The base model chosen was Qwen2-VL 2B, selected for its efficient size and better handling of Southeast Asian languages. Initial benchmarks highlighted low accuracy, primarily due to limited language coverage. To address this, Grab generated a synthetic OCR dataset from a vast online text corpus and leveraged its internal platform, Documint, for auto-labeling and pre-processing. This platform significantly improved OCR performance by detecting document regions and correcting orientations.

Grab's experimentation involved multiple phases. They initially tried Low-Rank Adaptation (LoRA) to fine-tune Qwen2VL, achieving good results with Latin scripts but facing challenges with non-Latin scripts like Thai. They then moved to full parameter fine-tuning, which led to substantial accuracy gains for Thai and Vietnamese documents. Finally, to further optimize resources, Grab built a lightweight Vision LLM from scratch, around 1 billion parameters. This new model combined the strengths of existing models and underwent a rigorous four-stage training process, including projector alignment and vision tower enhancement, to ensure robust performance in document processing.

Questions about this article

No questions yet.