Telling the AI Which Image Is Which

The Silent Assumption

For months, our UGC image generation worked like this: we assembled up to nine reference images — an influencer face cutout, several product component images, and a couple of moodboard reference photos — then passed them all to the generation model alongside a detailed text prompt describing the scene we wanted.

The model performed reasonably well. It usually got the face right. It often got the product in the right general area. The moodboard sometimes influenced the background style, and sometimes it influenced the person's clothing instead.

We had been assuming the model understood the implicit roles of each reference image. It didn't. We had never told it.

What the Model Actually Sees

When a multi-reference image generation model receives nine images alongside a text prompt, it treats them as a pool of visual conditioning signals. It tries to extract relevant information from each image and blend those signals with the text instruction. But it has no inherent way of knowing that image three is a moodboard reference for the background and not a product to be held, or that image one is specifically the person's face and not merely a style guide.

The model was doing its best to reconcile nine ambiguous conditioning signals with a text prompt that never clarified what any of them meant.

The missing piece was simple: explicit directives. Modern multi-reference models support inline image addressing syntax — a way to name specific references directly inside the prompt text. This feature lets the prompt say, in effect, "the person in this image should do this action, with the product from that image, in a setting inspired by those images." The model stops guessing and starts following a composition brief.

From Description to Directed Composition

Before the change, our prompt structure opened with general scene-setting — theme, mood, a paragraph establishing what kind of image we wanted. The reference images sat alongside, doing their best to influence the output.

After the change, the prompt opens with a targeted composition directive. It names the face reference by number. It names the product component references by number. It identifies the moodboard references as background style guides rather than content to replicate. It specifies the interaction between the person and the product in terms of the specific references, not generic descriptions.

The directive is structured as a brief: generate a scene where this person does this action with this product, and derive the background aesthetic — not the specific objects — from these reference images. Do not copy the moodboard images. Extract their color grading, their lighting mood, their sense of place, and apply those qualities to a freshly imagined background.

Why the Moodboard Distinction Matters

The most important clarification in the new directives is about moodboard references: the instruction to use them as inspiration for atmosphere, not as content to reproduce.

Without this distinction, the model sometimes carried people or objects from the moodboard images into the generated scene. If the moodboard image showed an outdoor café table, the model might generate the same table as the product's resting surface, even though the brand's product had nothing to do with cafés. The model was doing what it thought it was supposed to do — incorporate the reference — just at the wrong conceptual level.

The explicit directive resolves this. The model now understands that the moodboard images carry lighting and mood information, not subject matter. The output background feels like the brand's aesthetic without being a copy of any reference image in the pool.

The Kontext Follow-Through

The composition directive upgrade extends into our post-processing pipeline as well. After the initial multi-reference generation, we run a refinement pass through a separate image editing model to blend lighting and resolve compositing artifacts. We updated that refinement prompt to use the same directed editing approach: name the source image explicitly, specify exactly what to fix, and enumerate every element — face, product, pose, clothing, background — that must not change.

This specificity reduced cases where the refinement pass overcorrected. A vague instruction like "make this look more natural" gave the editing model too much latitude. An explicit instruction to blend lighting while preserving named elements gives it a clear target. The refinement becomes surgical rather than speculative.

The Gap Between Capability and Usage

The explicit image addressing feature has been part of the multi-reference API specification for some time. We had access to it from the day we integrated the model. What changed was not the model's capabilities but our understanding of how to engage them.

This is a recurring pattern in working with generative AI systems at production scale. The model can do things that your prompt doesn't ask it to do. The gap between what the model can produce and what your pipeline actually generates is often a gap in instruction clarity rather than a gap in model capability.

Nine reference images with ambiguous roles produced plausible results. Nine reference images with explicit role assignments produce directed ones. The difference isn't in the data we pass — it's in how clearly we tell the model what each piece of data means.

The best prompt isn't the longest one. It's the one that leaves the fewest assumptions unaddressed.