Quit Emailing Yourself

How Grab Built a Vision LLM to Scan Images

6 min read | Saved February 14, 2026 | Copied!

vision-llm 🤖 ocr 🤖 southeast-asian 🤖 document-processing 🤖 machine-learning 🤖

Do you care about this?

Grab built a specialized Vision LLM to improve the accuracy of information extraction from user documents for eKYC verification. They faced challenges with traditional OCR systems and fine-tuned existing models, ultimately creating a model that can process Southeast Asian languages and diverse document formats. The article details their technical approach and training methods.

If you do, here's more

Grab faced challenges with traditional Optical Character Recognition (OCR) systems for processing user-submitted documents, which were essential for eKYC verification. Existing models struggled with Southeast Asian languages and varied document formats, leading to errors and high latency. In response, Grab's engineering team decided to build a specialized Vision Large Language Model (LLM) tailored to their needs. They started by selecting Qwen2-VL 2B as their base model due to its manageable size of 2 billion parameters, effective support for Southeast Asian languages, and the ability to process images in their native resolution, which is vital for accurate OCR.

To train this model, Grab used two main strategies for dataset generation. The first involved creating a synthetic OCR dataset by extracting text from Common Crawl and generating images with various fonts and backgrounds. This approach allowed them to create an extensive and varied dataset covering multiple languages like Bahasa Indonesia, Thai, and Vietnamese. The second strategy utilized Documint, an internal platform designed for auto-labeling and preprocessing real documents. It included modules for document detection, orientation correction, OCR extraction, and Key Information Extraction (KIE), leading to high-quality labeled datasets.

The development process unfolded in three phases, beginning with Low-Rank Adaptation (LoRA) for fine-tuning, which updates only a fraction of the model's parameters. This method proved efficient, reducing the need for extensive retraining. Grab's systematic approach to building and refining their Vision LLM illustrates the complexities of adapting AI technologies to specific regional challenges, particularly in language diversity and document formatting.

Questions about this article

No questions yet.