Google just announced native image generation in Gemini 2.0 Flash and a new TTS model in Gemini 3.1 Flash. The image gen is available in AI Studio and the API right now. The TTS model has granular audio tags for controlling prosody and tone. Both announcements frame this as expanding what the model can do natively.
The pattern here is familiar. New modality gets added to a model. Demos look clean. Then it hits production and the same problems show up that killed the last wave of multimodal experiments.
First problem is the same integration wall that breaks most AI projects. Native image gen means the model can output an image instead of text. That sounds simple until you try to plug it into an actual workflow. What happens when the generated image needs to go into a CMS that expects S3 URLs. Or a design review tool that only ingests PNGs with specific metadata. Or a notification system that was built assuming text-only outputs. You end up writing adapter layers to bridge the model’s output format to the system’s input format. That adapter code becomes the thing that breaks when Google changes the API response shape or adds a new image format.
Second problem is that adding modalities increases the surface area for failure without increasing the value proportionally. A model that can generate text and images has two ways to produce garbage instead of one. If your prompt is ambiguous the model might generate an image when you wanted text or text when you wanted an image. Now you need routing logic to decide which modality to invoke. And error handling for when the image gen fails but the text gen succeeds. And fallback paths for when the user’s context doesn’t support images. Every branch is another place the system can degrade silently.
The TTS model has the same issue. Granular audio tags for controlling prosody sound useful until you realise someone has to decide what values to pass for those tags in every context. If it’s user-facing that’s a UI problem. If it’s automated that’s a prompt design problem. Either way it’s more surface area and more ways for the output to drift from what anyone expected.
Third problem is ownership. When a text-only system breaks it’s clear who fixes it. When a system generates text and images and audio it’s not clear if the bug is in the model, the adapter layer, the downstream service, or the original prompt. So the issue sits in a backlog while different teams argue about whose problem it is. Eventually someone duct-tapes a workaround and the system keeps running in a degraded state.
The teams that actually get value from multimodal models don’t start by adding every modality the API supports. They pick one problem where a specific modality solves a real workflow gap. They keep the integration narrow. They make sure one person owns the whole path from model output to downstream action. They resist the urge to add image gen just because it’s available now.
Google shipping native image gen and TTS in Flash models is not the unlock. The unlock is whether anyone can resist building the same overbuilt multimodal pipelines that collapsed under their own complexity last time. My guess is most teams won’t resist and we’ll see the same demos-that-never-ship cycle play out again.