Nine References, One Shot: How We Pushed UGC Image Fidelity Further

The Reference Bottleneck

After rearchitecting our UGC pipeline to use flux-2-pro/edit as the primary generation engine, we hit an unexpected limitation in our own implementation: we were capping reference images at four.

Four is a reasonable starting point. One face, one product, one moodboard — that's three. You'd rarely need more, or so we thought.

The reality turned out to be different. Products have multiple components. A skincare line has a pump bottle, a compact, and a serum. A food brand has the jar, the lid, and a serving platter. When we passed only the full-product photo as a reference, the model would reproduce the product's general color and shape but miss component-level details that actually matter to the brand: the embossed logo on the jar lid, the distinctive cap shape, the exact packaging texture.

At the same time, our moodboard was being used at a one-image limit — we'd built in support for multiple moodboard images but only passed the first one. Brands often have two or three reference images that together define their visual identity. Using only one left half the style signal on the table.

Filling the Slots

The fal-ai/flux-2-pro/edit model supports up to nine simultaneous reference images. We were using at most four. Closing that gap required deciding what priority order the references should follow.

We landed on: face first, then product segments, then moodboard — in that order.

The reasoning is straightforward. Face identity is the hardest constraint to satisfy. If the model is forced to make trade-offs between references due to conflicting conditioning signals, we want it to anchor the face before anything else. Product segments come next: they carry product-level visual grounding that prevents the model from inventing packaging or hallucinating the label design. Moodboard images are last — they're softer style signals, and the model can draw from them even at lower attention weight.

For product segments specifically, we now include the SAM2-segmented component cutout first, followed by the full product image set. A business with three product images gets all three passed as references. The final cap is nine images total: face at slot zero, up to six product refs in slots one through six, and up to two moodboard images in the remaining slots.

One structural rule we enforced: product references are capped at six regardless of how many are available. This guarantees that moodboard images always get at least one slot in the final reference array. Without this cap, a business with eight product images would crowd out the moodboard entirely, and the generated scene would lose brand visual identity in favor of product accuracy — an overcorrection in the wrong direction.

The Naturalization Problem

Multi-reference generation solves a fidelity problem. It creates a different problem: compositing artifacts.

When a model conditions simultaneously on a face, a product, and a moodboard, it's synthesizing constraints that may have come from very different photographic contexts. The reference images were taken in different lighting environments, at different times of day, with different camera settings. The model's job is to produce a coherent scene from incoherent inputs.

The output is usually plausible. But plausible isn't the same as photographed. The skin might have a slightly different color temperature than the background. The product might sit with correct geometry but wrong shadows. The moodboard style might bleed into the lighting in a way that conflicts with the scene the model tried to compose around the face. These artifacts are subtle — they don't fail a realism gate that's looking for AI tells at the broad-stroke level — but they're visible to anyone who looks closely at the images.

The fix is a second pass.

After the flux-2-pro/edit generation produces a raw composite, we pass that output image through flux-pro/kontext with a photorealism refinement prompt. Kontext is an image editor — the generated scene is the canvas it modifies in place. We ask it to unify lighting across all elements, integrate shadows naturally, and restore skin texture to authentic levels with visible pores and natural variation. Composition, people, products, and poses are explicitly preserved.

This adds one API call to the pipeline at a cost of around four cents. The output is the image we'd always wanted the first pass to produce: a scene where every element looks like it was photographed together, not assembled from separate sources.

Color Theme as a Prompt Constraint

The third dimension of this update is subtler. Brand color palettes extracted during onboarding have always been stored in brand memory. What we weren't doing was actively driving the image generation toward those colors.

We now inject the brand's primary and accent colors as hex codes directly into the image prompt. The instruction is explicit: these colors must appear in background and props. Not as a vague stylistic preference, but as a concrete constraint that the prompt carries into the generation call.

Diffusion models have a surprisingly nuanced understanding of color. Specifying hex values in prompt text doesn't map to exact color reproduction — the model is not a paint-by-number system — but it significantly shifts the probability mass toward the target palette. The result is that generated scenes feel more on-brand across a calendar of posts, not because every image uses exactly the right shade, but because the color vocabulary is consistent.

This is already handled by the brand theme system built into our prompt templates. The change here was behavioral: ensuring the hex values were surfaced from brand memory and passed through for every generation call rather than only when a brand theme had been explicitly set.

What This Costs

The new pipeline for a UGC post runs two Fal AI calls: one flux-2-pro/edit for the initial composite, and one flux-pro/kontext for the naturalization pass. Together, these cost approximately eight cents per image under normal conditions.

If the face consistency gate or product fidelity gate returns a failing score, a face refinement call using Kontext's multi-reference model adds another eight cents. The worst-case cost per image — initial composite plus naturalization plus refinement — is sixteen cents. That's the same as our previous two-pass Kontext pipeline, but for a meaningfully higher fidelity ceiling.

Looking Ahead

The nine-reference limit is a property of the current flux-2-pro/edit model. Future models in the Flux family will likely raise or change this limit, and the architecture we've built scales naturally to accommodate more references when they become available.

More immediately, we're watching whether brands with more complex product lines — multiple SKUs, accessories, component-level customization — benefit disproportionately from the additional product reference slots. Early signals suggest they do: the more visual surface area a product has, the more value each additional reference image provides.

The direction is clear. The closer we can bring the reference inputs to the full visual reality of the brand and its products, the more accurately the generated content represents what the brand actually looks like — and the less time brand teams spend requesting revisions.