45 links
tagged with image-generation
Click any tag below to further narrow down your results
Links
Liquid is an innovative auto-regressive model that integrates visual comprehension and generation by tokenizing images into discrete codes and learning them alongside text tokens. This multimodal large language model operates within a shared feature space, allowing for seamless understanding and generation without relying on external visual embeddings. Liquid is available in multiple sizes and explores the scaling laws of multimodal models, revealing mutual benefits between understanding and generation tasks.
OpenAI is testing a watermark feature for its Image Generation model within the ChatGPT 4o framework, primarily due to the popularity of users generating Studio Ghibli-style art. The watermark will be applied to images created by free users, while ChatGPT Plus subscribers will have the option to save images without the watermark. There are ongoing developments for an ImageGen API, allowing developers to create their own applications.
ConceptAttention is an interpretability method designed for multi-modal diffusion transformers, specifically implemented for the Flux DiT architecture using PyTorch. The article provides installation instructions and a code example for generating images and concept attention heatmaps. It also references the associated research paper for further details.
A new library for image generation using ChatGPT has been made available, allowing users to create images based on text prompts. This development enhances the capabilities of AI in the creative field, enabling seamless integration of text-to-image generation for various applications.
Qwen Chat provides a wide range of functionalities, including chatbot capabilities, image and video understanding, and image generation. It also supports document processing, web search integration, and tool utilization, making it a versatile solution for various tasks.
A novel image generation approach called Next Visual Granularity (NVG) is introduced, which decomposes images into structured sequences to progressively refine them from a global layout to fine details. The NVG framework allows for high-fidelity and diverse image generation by utilizing a hierarchical representation that guides the process based on input text and current canvas. Extensive training on the ImageNet dataset demonstrates NVG's superior performance compared to previous models, with clear scaling behavior and improved FID scores.
Imagen 4, Google's latest text-to-image model, is now available for paid preview in the Gemini API and for limited free testing in Google AI Studio. It includes two variants, Imagen 4 for general tasks and Imagen 4 Ultra for precision, both featuring improved text rendering and image generation quality. All generated images will include a non-visible digital watermark for trust and transparency.
Qwen-Image, a 20B MMDiT image foundation model, offers advanced capabilities in complex text rendering and image editing, outperforming existing models in various benchmarks. Its strengths include high-fidelity text generation in both English and Chinese, consistent image editing, and versatility in artistic styles, making it a powerful tool for content creators. The model aims to lower barriers in visual content creation and foster community engagement in generative AI development.
The article discusses various updates and features introduced at DevDay 2025, including the Sora 2 SDK, advancements in GPT-5 with Pro AgentKit, and new capabilities in image generation and speech-to-speech mini models. These innovations are aimed at enhancing user experiences and expanding the functionality of applications powered by OpenAI technologies.
FLUX.1 Kontext [pro] is an advanced image generation and editing model that emphasizes prompt adherence. The article provides several examples of API usage for tasks such as image generation, chat completions, and audio processing using this model, although it is currently unsupported on Together AI.
Google has launched the Gemini 2.5 Flash Image model, now available to developers and enterprises through the Gemini API, Google AI Studio, and Vertex AI. This production-ready tool offers advanced features for image generation and editing, supporting multiple aspect ratios and enabling real-time applications at competitive pricing. Developers are already incorporating it into various creative and educational workflows.
REPA-E introduces a family of end-to-end tuned Variational Autoencoders (VAEs) that significantly improve text-to-image (T2I) generation quality and training efficiency. The method enables effective joint training of VAEs and diffusion models, achieving state-of-the-art performance on ImageNet and enhancing latent space structure across various VAE architectures. Results show accelerated generation performance and better image quality, making E2E-VAEs superior replacements for traditional VAEs.
The article presents the Decoupled Diffusion Transformer (DDT) architecture, demonstrating improved performance with a larger encoder in a diffusion model framework. It achieves state-of-the-art FID scores on ImageNet benchmarks and allows for accelerated inference by reusing encoders across steps. The implementation provides detailed configurations for training and inference, along with online demos.
OmniCaptioner is a versatile visual captioning framework designed to generate detailed textual descriptions across various visual domains, including natural images, visual text, and structured visuals. It enhances visual reasoning with large language models (LLMs), improves image generation tasks, and allows for efficient supervised fine-tuning by converting pixel data into rich semantic representations. The framework aims to bridge the gap between visual and textual modalities through a unified multimodal pretraining approach.
OpenAI reported that ChatGPT users have generated over 700 million images within a week, highlighting the rapid growth and popularity of its image generation capabilities. The surge in usage reflects a significant increase in user engagement and interest in AI-generated content.
ByteDance has launched its AI image generation tool, Seedream 4.0, claiming it surpasses Google DeepMind's Nano Banana in key performance metrics like prompt adherence and aesthetics. While Seedream 4.0 combines the capabilities of its predecessors and offers faster image processing, it has yet to be evaluated by major benchmark firms. The tool is currently available to domestic users and corporate clients at competitive pricing.
OpenAI is expanding its image-generating feature, gpt-image-1, to other developers and applications, including Adobe's Firefly and tools like Figma and Wix. This follows a surge in usage where over 130 million users created 700 million images in just the first week. Additionally, Microsoft will integrate OpenAI's image generation into its Microsoft 365 Copilot app, enhancing competition with Google in the generative AI market.
NVIDIA has introduced a new AI blueprint that facilitates the integration between Blender and AI image generation tools, enhancing the workflow for 3D artists. This development aims to streamline the creative process, allowing users to leverage AI capabilities directly within their 3D modeling environment.
Llama 4 Scout is a state-of-the-art 109 billion parameter model designed for tasks such as multi-document analysis, codebase reasoning, and personalized tasks. While it currently lacks support on Together AI, the platform offers a variety of APIs for different functionalities including chat completions, image generation, audio transcription, and video creation. Users can register for an account to access the API and utilize free credits to start their projects.
Large diffusion models like Flux can generate impressive images but require substantial memory, making quantization an attractive option to reduce their size without significantly affecting output quality. The article discusses various quantization backends available in Hugging Face Diffusers, including bitsandbytes, torchao, and Quanto, and provides examples of how to implement these quantizations to optimize memory usage and performance in image generation tasks.
The article discusses the integration of ChatGPT with image generation capabilities, exploring how this combination can enhance user creativity and productivity. It highlights various applications and potential use cases, emphasizing the transformative impact of AI in visual content creation.
OpenAI has introduced new tools and features to its Responses API, enhancing capabilities for developers building agentic applications. Key updates include support for remote MCP servers, enhanced image generation, Code Interpreter integration, and improved reliability and privacy features for enterprises.
OpenAI has introduced the `gpt-image-1` model for image generation via its API, allowing developers to integrate high-quality image creation into their products. The model supports diverse styles and applications, with notable collaborations from companies like Adobe, Canva, and HubSpot to enhance creative and marketing processes.
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
PixelFlow introduces a novel approach to image generation by operating directly in raw pixel space, eliminating the need for pre-trained Variational Autoencoders. This method enhances the image generation process with efficient cascade flow modeling, achieving a competitive FID score of 1.98 on the ImageNet benchmark while offering high-quality and semantically controlled image outputs. The work aims to inspire future developments in visual generation models.
Representation Autoencoders (RAEs) enhance diffusion transformers by leveraging pretrained encoders and lightweight decoders to achieve superior image generation results, outperforming traditional methods like SD-VAE. The study reveals that RAE's reconstruction quality is high, and for optimal performance, the model width must match or exceed the encoder's token dimension. Additionally, the proposed DiTDH model demonstrates significant efficiency and effectiveness, setting new state-of-the-art scores in image generation tasks.
HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.
Llama 4 Maverick is a state-of-the-art multilingual model designed for image and text understanding, creative writing, and enterprise applications. While it is not yet supported on Together AI, users can register for an account to access various API functionalities, including image generation, chat completions, and audio transcriptions. The model allows for versatile applications such as generating videos and embeddings based on user prompts.
Google has launched Gemini 2.5 Flash Image, an advanced image generation and editing model that allows users to blend multiple images, maintain character consistency, and execute targeted transformations using natural language. The model is available through the Gemini API and Google AI Studio for developers, priced at $30 per million output tokens, and includes features for creating custom apps and educational tools. All generated images will carry an invisible digital watermark for identification as AI-generated content.
PixelFlow introduces a novel family of image generation models that operate directly in pixel space, eliminating the need for pre-trained VAEs and allowing for end-to-end training. By utilizing efficient cascade flow modeling, it achieves impressive image quality with a low FID score of 1.98 on the ImageNet benchmark, showcasing its potential for both class-to-image and text-to-image tasks. The model aims to inspire future advancements in visual generation technologies.
OpenAI is experimenting with visible and invisible watermarks for images generated by its ChatGPT-4o model to enhance content traceability and compliance. The visible watermark, labeled “ImageGen,” is being tested for free-tier users while paid users will receive images without watermarks. This move aligns with broader industry efforts to improve attribution for AI-generated content.
AI image generation can be both rewarding and challenging, with varying results based on the prompts used. Experts share tips and techniques for leveraging different AI tools effectively, including using style codes, brainstorming image ideas, and refining prompts. The article highlights the importance of experimentation and creativity in producing quality images with AI.
A stealth AI model has outperformed well-known competitors like DALL-E and Midjourney on a popular benchmark, demonstrating its advanced capabilities in image generation. The creators of this model have successfully secured $30 million in funding to further develop their technology.
GigaTok is a novel method designed for scaling visual tokenizers to 3 billion parameters, addressing the reconstruction vs. generation dilemma through semantic regularization. It offers a comprehensive framework for training and evaluating tokenizers, alongside various model configurations and instructions for setup and usage. The project is a collaboration involving extensive research and experimentation, with resources available for further exploration.
A new model labeled "GPT-5 Mini Scout" briefly appeared in ChatGPT's model selector, sparking speculation about its connection to a new Company Knowledge feature for enterprise users. An update in OpenAI’s JavaScript library hinted at the model being named "GPT-5.1 Mini," suggesting significant advancements in image generation capabilities. The potential rollout for this model is anticipated in November, possibly in response to competitors like Google's Gemini 3.
ByteDance has introduced a new AI image model aimed at competing with Google DeepMind's Nano Banana, showcasing advancements in image generation technology. This development highlights the growing rivalry in the AI landscape, particularly among major tech companies.
Google Cloud has expanded Vertex AI with three new generative AI media models: Imagen 4 for high-quality image generation, Veo 3 for advanced video creation with audio, and Lyria 2 for music generation. These tools aim to enhance content creation efficiency and creativity across various industries, enabling users to produce stunning visual and audio assets more rapidly.
Generating detailed images with AI has become more accessible by connecting Claude to Hugging Face Spaces, enabling users to leverage advanced models like FLUX.1 Krea and Qwen-Image. These models enhance image realism and text quality, allowing for creative projects such as posters and marketing materials. Users can easily configure and switch between these models to achieve desired results.
Midjourney has unveiled its latest AI image model, marking its first significant release in nearly a year. The new model focuses on enhanced image generation capabilities, providing users with improved tools for creative expression. This update reflects Midjourney's commitment to advancing AI technology in the visual arts.
HiDream-I1 is an open-source image generative foundation model boasting 17 billion parameters, delivering high-quality image generation in seconds. Its recent updates include the release of various models and integrations with popular platforms, enhancing its usability for developers and users alike. For full capabilities, users can explore additional resources and demos linked in the article.
VARGPT-v1.1 is a powerful multimodal model that enhances visual understanding and generation capabilities through iterative instruction tuning and reinforcement learning. It includes extensive code releases for training, inference, and evaluation, as well as a comprehensive structure for multimodal tasks such as image captioning and visual question answering. The model's checkpoints and datasets are available on Hugging Face, facilitating further research and application development.
UCGM is an official PyTorch implementation that provides a unified framework for training and sampling continuous generative models, such as diffusion and flow-matching models. It enables significant acceleration of sampling processes and efficient tuning of pre-trained models, achieving impressive FID scores across various datasets and resolutions. The framework supports diverse architectures and offers tools for both training and evaluating generative models.
Google has launched a preview of its Gemini 2.0 Flash image generation capabilities, enabling developers to integrate enhanced conversational image generation and editing with improved visual quality and reduced filter block rates. The Gemini API is available through Google AI Studio and Vertex AI, encouraging developers to explore its functionalities, including recontextualizing products in new environments.
xAI is set to enhance its Grok app with the introduction of a new character, Valentin, and a feature called Imagine that enables infinite image and video generation with sound. These updates aim to attract creative users, particularly women, by offering customizable experiences and a focus on user-generated content. The launch is anticipated to coincide with the release of GPT-5, positioning Grok as a competitive player in the generative AI landscape.
CogView4-6B is a text-to-image generation model that supports a range of resolutions and offers optimized memory usage through CPU offloading. The model has demonstrated impressive performance benchmarks compared to other models like DALL-E 3 and SDXL, achieving high scores across various evaluation metrics. Users can install the necessary libraries and use a provided code snippet to generate images based on detailed prompts.