Ollama has introduced a new engine that supports multimodal models, emphasizing improved accuracy, model modularity, and memory management. The update allows for better integration of vision and text models, enhancing the capabilities of local inference for various applications, including image recognition and reasoning. Future developments will focus on supporting longer context sizes and enabling advanced functionalities.
SGLang has integrated Hugging Face transformers as a backend, enhancing inference performance for models while maintaining the flexibility of the transformers library. This integration allows for high-throughput, low-latency tasks and supports models not natively compatible with SGLang, streamlining deployment and usage. Key features include automatic fallback to transformers and optimized performance through mechanisms like RadixAttention.