One Model, Three References, Zero Adapter Stacking

Two Separate Problems

The images looked generated for two reasons that turned out to need two separate fixes.

The first was in the prompt. The instructions we were giving the model were generic — a film stock that added warmth everywhere regardless of scene, lighting described as "natural window lighting" for every environment, camera angles passed as labels ("medium shot") rather than directives, and an expression instruction that said "authentic smile, not posed" while producing the same camera-facing smile in every post.

The second was in the model architecture. Even when the prompt was specific, the output had a ceiling imposed by how we were maintaining consistency. Three IP-Adapters — face, product, style — injected simultaneously into the cross-attention layers. They competed. The model satisfied all three constraints by reducing variance everywhere. Less creative freedom, more assembled look.

Both problems needed to be solved. We fixed the prompt first. Then we replaced the adapter stack.

Fixing the Prompt

Film stock. Every prompt specified Kodak Portra 400 — a film stock beloved in portrait photography for its warm midtones and saturated shadows. Those qualities are beautiful at golden hour on a hilltop. They're wrong inside a warung, a gym, or a fashion boutique where the actual light is fluorescent, harsh, or flat overcast. We switched to Kodak Ektar 100, built for neutral color reproduction. No warm cast. The model now reads the scene's actual light and renders it honestly.

Scene lighting. "Natural window lighting" appeared in every prompt regardless of setting. We added a lighting field to every scene template, matching the actual light of each environment. Food scenes now specify "mixed overhead fluorescent and warm incandescent accent, green-white color temperature 4000K, hard shadows under chin and arms." Fitness specifies "dramatic overhead gymnasium lights, high contrast, deep shadows in muscle definition, harsh specular highlights on sweating skin." Fashion specifies "bright overcast outdoor daylight, soft even diffusion, natural color temperature 6500K." The model gets a photographic brief instead of a default.

Camera angle templates. Scene planners output labels like "medium shot" or "low angle." Labels tell the model almost nothing — "medium shot" conveys roughly the same information as "take a normal photo." We built eight angle templates, each a precise framing instruction. The medium shot template specifies head-to-waist framing, subject at 50–60% of frame height, rule-of-thirds positioning, background softly blurred at f/1.8–2.8. The low angle template specifies camera below the subject's knee level, tilted upward 20–40 degrees, strong perspective distortion on vertical lines. The flat lay specifies perfectly perpendicular camera axis, zero perspective distortion, even shadowless overhead lighting. A label that told the model almost nothing now tells it exactly where to put the camera.

Expression. Every UGC prompt instructed the model to show "an authentic asymmetrical smile, not posed." The result was the same pleasant, camera-aware expression in every post — warung scenes, gym scenes, skincare scenes, all of them. Smiling at the camera is by definition posed. We replaced it with "genuine caught-in-action expression — mid-laugh, mid-reaction, or natural focus — NOT a posed smile to camera." The model now understands that the target is a moment being captured, not a portrait being staged.

The Stacking Problem

After fixing the prompt, a second ceiling remained. Every UGC post needs three things to look right: the influencer's face, the product, and the campaign's visual style. We were passing all three as separate IP-Adapters into a single Flux generation call — face adapter, product adapter, style adapter, all running simultaneously.

When you stack three IP-Adapters, each one injects its conditioning signal into the cross-attention layers. They don't cooperate — they compete. The face adapter wants the image to look like a specific person. The product adapter wants the image to include a specific object at a specific scale. The style adapter wants the colors and mood to match a reference. The model has to satisfy all three constraints at once, and the way it does that is by reducing variance everywhere. Less creative freedom, more "produced" look. The images were consistent, but they had a ceiling — a subtle assembled quality that no amount of prompt engineering could push through.

The cost also reflected this: fal-ai/flux-general with three adapters running at $0.075 per megapixel, billing at 2 megapixels for a 1080×1080 output, came to $0.15 per image. And it took about eleven seconds.

What Kontext Does Differently

Flux Kontext doesn't inject adapters. It takes reference images as native inputs — the same way you or I look at a photograph to remember what someone looks like before describing them in words.

The model processes the reference images alongside the text prompt in a unified attention pass. There's no separate conditioning stream competing with generation. The face reference doesn't fight the product reference. The style reference doesn't fight either of them. The model interprets all three as context and generates accordingly — with full access to its photorealistic output range, not a reduced range constrained by adapter weights.

The result reads differently in the output. The images have the organic variation that signals a real photograph: slight differences in how light falls, natural background complexity, expressions that look mid-moment rather than positioned. The consistency is still there — you still recognize the influencer, the product is still present, the visual palette still fits the brand. But it doesn't look assembled.

Reference Order Is a Design Decision

Kontext doesn't document exactly how it weights multiple references relative to each other, but ordering matters in practice. We pass references differently depending on what needs to be most consistent.

For UGC and carousel posts: face first, then product, then style. [influencerFaceUrl, productImageUrl, ...moodBoardRefs]. The influencer's face is the most important constraint. A slightly off-brand color palette is recoverable. An influencer who doesn't look like themselves breaks the content.

For product posts: product first, then style. [productImageUrl, ...moodBoardRefs]. There's no face to anchor — the product is the primary subject. Placing it first biases Kontext's attention toward accurate product rendering, with the moodboard providing context for the scene aesthetic.

For influencer DNA beyond the face — personality, content style, visual energy — the prompt carries it. influencerDNA.personalityTraits and contentStyle are injected into every generation prompt, so the model gets explicit text guidance on who this person is in addition to what they look like.

The Numbers

Post type	Before	After	Cost	Time
UGC (service)	`flux-general` 3 adapters	`kontext/max/multi`	$0.15 → $0.08	11s → 2s
UGC (physical product)	PuLID + composite	`kontext/max/multi`	$0.15 → $0.08	11s → 2s
Carousel cover	`flux-pulid`	`kontext`	$0.035 → $0.04	4s → 2s
Product post	IP-Adapter fallback	`kontext` (product-first)	$0.15 → $0.08	11s → 2s

The carousel cover is marginally more expensive — $0.04 versus $0.035 — but the photorealism improvement and speed gain justify it.

For physical product posts, the hybrid path (AI-generated background with real product cutout composited on top) remains the primary strategy. Kontext steps in as the fallback if compositing fails, giving a strong second option instead of the IP-Adapter path.

Guidance Scale

Two numeric parameters were quietly working against us. PuLID was running at guidance scale 3.5 — high fidelity, but the model followed instructions so precisely that there was no room for the organic variation that makes faces look like photographs. The multi-IP-Adapter composite was running at 4.0 — even higher.

We lowered PuLID guidance to 2.8 and composite guidance to 3.0. The model now treats instructions as strong suggestions rather than rigid specifications. The influencer still looks like the reference. The product is still visible. The aesthetic is still present. But the rendering has space to breathe — to add the slight randomness in lighting, skin texture, and composition that signals a real photograph rather than a render.

Compounding Small Fixes

None of these changes individually produces a dramatic transformation. A neutral film stock, scene-appropriate lighting, exact framing instructions, a different expression directive, lower guidance values, and native multi-reference conditioning — each sounds incremental. But photorealism is multiplicative. Every small imperfection compounds with every other. Removing six sources of artificiality simultaneously produces an image that reads as genuinely more real than any one fix would suggest.

The underlying lesson: the best results come from giving the model accurate, specific context and then getting out of the way. Not longer prompts — specific ones. Not more adapters — the right architecture. The model is capable. The work is in the setup.