More on the topic...
Generating detailed summary...
Failed to generate summary. Please try again.
GLM-4.6V Series has launched, featuring two models: GLM-4.6V (106B) and GLM-4.6V-Flash (9B). The flagship model offers a 128K context, while the lightweight version targets local and low-latency tasks. A notable advancement is the introduction of native Function Calling within this vision model family, enhancing its usability. Pricing is set at $0.6 for input and $0.9 for output per million tokens for the main model, while the Flash version is free. Both models excel in handling multimodal inputs, allowing for structured image-text content generation and supporting a seamless workflow from visual perception to reasoning.
Earlier, GLM-4.5V made headlines as a leading open-source visual reasoning model, outperforming 41 benchmarks. Built on the GLM-4.5-Air base, it employs a 106B-parameter MoE architecture to scale effectively. This model is adept at various visual reasoning tasks, such as image and video understanding, GUI tasks, and complex document analysis. Its capabilities include precise localization of visual elements and thorough analysis of research reports, making it versatile for different applications.
Resources for both models include links to Hugging Face collections, GitHub repositories, and API documentation, facilitating user access and integration. Interested users can try GLM-4.6V at chat.z.ai, with further exploration available through the provided tech blog and API guides.
Questions about this article
No questions yet.