Why Our UGC Posts Had the Same Background (And How We Fixed It)
The Symptoms
Batches of UGC posts were coming out wrong in two distinct ways.
First: every post in a batch shared the same background, or backgrounds so similar that they read as one. Coffee shop in post one. Coffee shop in post two. Coffee shop in post three. The influencer's face varied slightly between images, but the scene was nearly identical across the entire content plan.
Second: sequential Kontext passes โ one to swap the face, one to add the product โ were producing posts with extra arms, mismatched hand counts, and body boundaries that looked stitched from two different people. The body seam between passes was often visible.
Both problems had the same root cause.
The Root Cause: Kontext Multi Is Not an Image Editor
We were calling fal-ai/flux-pro/kontext/max/multi with multiple image_urls โ the influencer face reference, the product reference, and the scene reference โ and treating the first image in the list as the canvas that the model would modify.
That's not what the model does.
The distinction matters enormously:
fal-ai/flux-pro/kontext(single): A true image-to-image editor. Theimage_urlparameter is the canvas. The model reads the prompt and modifies the canvas image in place. Background, scene, and composition are preserved by default.fal-ai/flux-pro/kontext/max/multi: A multi-reference generator. All entries inimage_urlsare equal conditioning inputs. There is no canvas. The model generates a new image that satisfies the semantic constraints of all inputs simultaneously.
Because the face reference and product reference were constant across every post in a batch, multi had the same set of conditioning inputs for every generation call. With nothing differentiating the scenes except the stochastic sampling seed, the model converged on similar compositions. The diverse backgrounds that Claude's scene planning generated were being thrown away โ the Kontext multi call never received them as editable canvases, only as additional constraints to satisfy alongside face and product.
The sequential pass architecture made things worse. Using the output of one Kontext call as the input for a second Kontext call โ first for the face, then for the product โ is effectively asking the model to re-compose the scene twice. Each re-composition pass can alter body geometry. Two passes doubled the opportunity for limb artifacts to appear.
The New Architecture
The fix required routing each post type to the model that actually matches its generation semantics.
| Post type | Pipeline |
|---|---|
| Carousel | flux-2-pro base โ Kontext single (caption-driven refinement) |
| Product | flux-2-pro background โ real product cutout composited (zero hallucination) |
| UGC Service | flux-2-pro base โ Kontext single face swap |
| UGC Physical | flux-2-pro base โ Kontext multi (face + product component) |
Each path has a specific reason for its model choice.
Carousel posts use flux-2-pro for base generation โ the scene is driven entirely by the prompt, so diversity comes from the prompt planner, not from reference image variation. Kontext single then refines the output using the caption as an instruction, editing the canvas image. One pass, no re-composition.
Physical product posts use the hybrid cutout path: AI-generated background with no product in the scene, then a real product photo composited in using Sharp. The product in the final image is literally the client's product photo. No generation pass can hallucinate the label or get the packaging wrong. flux-2-pro is the fallback if the cutout composite fails quality checks.
UGC service posts (no physical product) use flux-2-pro for the base scene, with the influencer activity described in the text prompt. Kontext single then does a face swap โ the flux output is the canvas, and only the face reference is provided. One Kontext pass, no product reference. The same-background bug is avoided here because (1) stage 1 is a dedicated flux-2-pro generation with a diverse scene plan per post, and (2) the Kontext face-swap prompt explicitly anchors to image 1 as the canvas, preserving whatever background flux generated.
UGC physical posts use flux-2-pro for the base scene with the product described in the text prompt, then a single Kontext multi pass with both the face reference and the product reference โ three images total: canvas, face, product. The replacement prompt instructs Kontext to preserve the canvas background and body, swap only the face from image 2, and replace only the held product with image 3. One pass handles both substitutions. Limb artifact risk drops to near zero because there is no second re-composition pass to alter body geometry. Background diversity is preserved for the same reason as the service path: flux-2-pro generates the scene first, and Kontext edits within it rather than replacing it.
Quality gates โ text detection, realism score โฅ 3.5, face consistency score โ still run after every generation path. The gates are unchanged; the inputs to them are better.
Photorealism Enforcement
All prompt builders now include camera and lighting anchors:
Shot on Sony A7 IV, 85mm f/1.4 lens, natural available light.
Real skin texture with visible pores, natural facial asymmetry.
NOT CGI, NOT illustrated, NOT digital art.
These tokens are not decorative. Diffusion models are trained on text-image pairs, and the word distribution around terms like "Sony A7 IV" or "85mm f/1.4" correlates heavily with real photographs in training data. When these tokens appear in the prompt, they pull the generation toward the statistical neighborhood of real photography rather than the neighborhood of rendered or illustrated images. The effect is measurable in realism gate pass rates, which improved after we added them.
The negation tokens (NOT CGI, NOT illustrated) serve a different function. Diffusion models don't process negation the way a logical system would, but placing these tokens in the prompt shifts attention weights away from the illustrated/rendered subspace, particularly in the early diffusion steps where global composition and material properties are established.
What the Future Looks Like
The current architecture is correct but not optimal. There are four directions worth pursuing, each with honest tradeoffs.
LoRA Fine-Tuning Per Influencer
Train a small LoRA (approximately 500 steps) on each influencer's photo set. Face identity gets baked into the model weights. No reference image overhead. No Kontext face-swap pass at all.
Cost: $2โ5 per influencer, one time. Ongoing cost per image: zero additional. Face accuracy is the best achievable โ the model has seen this specific person's facial geometry during training, not just at inference time.
The downside is operational. Each influencer needs a training pipeline, 20โ30 quality source photos with good lighting and angle variety, and the LoRA needs to be re-trained if the influencer changes their appearance significantly. For high-volume influencers who generate hundreds of posts, the unit economics are excellent. For one-off campaigns, the setup overhead may not be worth it.
One hard constraint: LoRA fine-tuning applies only to fal-ai/flux/dev. It does not apply to flux-2-pro or to the Kontext models โ both are closed API endpoints that do not accept custom weights. The current UGC pipeline uses flux-2-pro for Stage 1 and Kontext for Stage 2. A LoRA-based face pipeline would require switching Stage 1 to flux/dev with the influencer LoRA loaded, and skipping the Kontext face-swap pass entirely. That is the highest-accuracy path for face consistency at scale, but it requires a separate base model and a per-influencer training run.
Flux Fill Inpainting for Face Swaps
Instead of using Kontext for face swaps, detect the face bounding box, create a tight mask, and inpaint only the face region with fal-ai/flux-pro/fill. Pixels outside the mask are mathematically untouched โ there is no re-composition of the scene. Background, body, hands, and clothing are identical in the input and output images, not regenerated to match.
This eliminates limb artifacts by construction. The model literally cannot change the arms because they are behind the mask. Background diversity is preserved because Kontext never sees the background pixels.
The limitation: Flux Fill inpainting can produce visible edge seams around the mask boundary, particularly if the original face lighting doesn't match the inpainted face. Getting the mask boundary and feathering right requires care. And because only the face region is regenerated, lighting integration between the new face and the existing scene relies entirely on the inpainting model's ability to match ambient conditions โ something a Kontext edit across the full image handles more naturally.
Dedicated Face Swap Model
Purpose-built face swap models like fal-ai/inswapper operate in milliseconds at sub-$0.01. They are not diffusion-based. They work by extracting facial landmarks and identity embeddings from the source face and transplanting them onto the target image with lighting normalization.
The speed and cost are compelling. The quality ceiling is lower than diffusion-based approaches โ lighting integration is approximate, and faces from very different lighting conditions can look pasted rather than photographed. For high-volume social content where per-image costs compound quickly, it's worth benchmarking against the current Kontext face-swap pass.
Scene Diversity via Structured Archetypes
The current scene planner asks Claude to generate diverse scene descriptions and relies on the language model's variation to prevent repetition. This works most of the time and fails occasionally, which is exactly the wrong failure mode for batch content generation.
A deterministic alternative: define a fixed set of scene archetypes (golden hour outdoor, moody interior, flat lay, macro close-up, urban street, minimal studio, nature setting, workspace) and rotate through them deterministically when generating a batch. The system guarantees no two posts in a twelve-post batch share a setting archetype without any reliance on stochastic language model behavior.
The tradeoff is inflexibility. Fixed archetypes can conflict with specific brand identities or campaign briefs. The solution is to configure which archetypes are in rotation per client rather than using a global set.
Cost Comparison
| Old pipeline | New pipeline | |
|---|---|---|
| UGC service post | Kontext multi ร2 sequential: ~$0.16 | flux-2-pro + Kontext single: ~$0.08 |
| UGC physical post | Kontext multi ร2 sequential: ~$0.16 | flux-2-pro + Kontext multi (1 pass): ~$0.08 |
| Artifact rate | Spliced bodies, multiple arms | None observed |
| Background diversity | Near-identical across batch | Varies per scene prompt |
The cost reduction is a side effect of using the right model for each task. Kontext multi is an expensive model that was being used wrong. A single Kontext pass costs half of two sequential passes and produces better body geometry because re-composition only happens once. For physical posts, the same single Kontext multi call now handles both substitutions โ face and product โ rather than requiring a second sequential pass.
The background diversity fix is not about cost โ it's structural. When the scene generation model receives a diverse prompt and produces a diverse output, and that output is not subsequently discarded and replaced by a multi-reference convergence call, diversity is preserved. The system now works with the model's properties instead of against them.