2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
The article discusses the launch of GLM-4.6V and GLM-4.5V, two advanced vision-language models. GLM-4.6V features a 128K context and supports multimodal inputs, while GLM-4.5V excels in visual reasoning across various benchmarks. Both models offer distinct capabilities for image and video analysis.
If you do, here's more
GLM-4.6V Series has launched, featuring two models: GLM-4.6V (106B) and GLM-4.6V-Flash (9B). The flagship model offers a 128K context, while the lightweight version targets local and low-latency tasks. A notable advancement is the introduction of native Function Calling within this vision model family, enhancing its usability. Pricing is set at $0.6 for input and $0.9 for output per million tokens for the main model, while the Flash version is free. Both models excel in handling multimodal inputs, allowing for structured image-text content generation and supporting a seamless workflow from visual perception to reasoning.
Earlier, GLM-4.5V made headlines as a leading open-source visual reasoning model, outperforming 41 benchmarks. Built on the GLM-4.5-Air base, it employs a 106B-parameter MoE architecture to scale effectively. This model is adept at various visual reasoning tasks, such as image and video understanding, GUI tasks, and complex document analysis. Its capabilities include precise localization of visual elements and thorough analysis of research reports, making it versatile for different applications.
Resources for both models include links to Hugging Face collections, GitHub repositories, and API documentation, facilitating user access and integration. Interested users can try GLM-4.6V at chat.z.ai, with further exploration available through the provided tech blog and API guides.
Questions about this article
No questions yet.