OpenAI’s new AI image generator is potent and bound to provoke
arstechnica.comOn Tuesday, OpenAI announced new multimodal image-generation capabilities that are directly integrated into its GPT-4o AI language model, making it the default image generator within the ChatGPT interface. The integration, called "4o Image Generation" (which we'll call "4o IG" for short), allows the model to follow prompts more accurately (with better text rendering than DALL-E 3) and respond to chat context for image modification instructions.
4o IG represents a shift to "native multimodal image generation," where the large language model processes and outputs image data directly as tokens. That's a big deal, because it means image tokens and text tokens share the same neural network. It leads to new flexibility in image creation and modification.
Despite baking-in multimodal image generation capabilities when GPT-4o launched in May 2024—when the "o" in GPT-4o was touted as standing for "omni" to highlight its ability to both understand and generate text, images, and audio—OpenAI has taken over 10 months to deliver the functionality to users, despite OpenAI president Greg Brock teasing the feature on X last year.
OpenAI was likely goaded by the release of Google's multimodal LLM-based image generator called "Gemini 2.0 Flash (Image Generation) Experimental," last week. The tech giants continue their AI arms race, with each attempting to one-up the other.
And perhaps we know why OpenAI waited: At a reasonable resolution and level of detail, the new 4o IG process is extremely slow, taking anywhere from 30 seconds to one minute (or longer) for each image.