Why Every AI Influencer Was Smiling Wrong
The Problem Was in the Film Stock
Here's a test. Look at two photos side by side. One is a real photograph shot inside a warung at seven in the morning โ the fluorescent tubes cast a greenish-white light, hard shadows fall under the chin, a single warm bulb glows behind the counter. The other is AI-generated. Same scene, same composition. But the AI version has a warm amber glow that looks like the photographer wrapped a golden filter around everything.
Nobody chose that filter. It was embedded in the film stock.
Our image prompts specified Kodak Portra 400, a film stock beloved by portrait photographers for its characteristic warm midtones and saturated shadows. Those qualities translate beautifully in a wedding shoot on a golden-hour hilltop. They're a disaster inside a warung, a gym, or a fashion boutique where the actual light is fluorescent, harsh, or overcast grey. The model read "Portra 400" and applied warmth to everything โ regardless of where the scene was set.
The fix was simple once we saw it: switch to Kodak Ektar 100, a film stock built for accurate neutral color reproduction. No warm cast, no saturated shadows, no cinematic amber glow that doesn't exist in the real environment. The model now reads the scene's actual light and renders it honestly.
Scenes Have Light. Prompts Didn't.
Changing the film stock removed the universal warm bias, but it exposed a deeper gap: our prompts weren't telling the model what kind of light the scene actually had.
"Natural window lighting" appeared in the camera configuration for every UGC post regardless of setting. Flux interpreted this as a soft, warm window light box โ the default indoor photography setup. A man eating at a warung got the same light as someone applying skincare at a vanity. An athlete mid-deadlift in a gym got the same soft fill as someone reading on a couch.
We added a lighting field to every scene template in the pipeline. The food category now specifies "mixed overhead fluorescent and warm incandescent accent, green-white color temperature 4000K, hard shadows under chin and arms." Fitness specifies "dramatic overhead gymnasium lights, high contrast, deep shadows in muscle definition, harsh specular highlights on sweating skin." Fashion specifies "bright overcast outdoor daylight, soft even diffusion, natural color temperature 6500K." Each lighting description is injected directly into the prompt alongside the scene, so the model gets a coherent picture of the environment rather than a generic camera configuration bolted onto an unrelated scene.
The gym image that prompted this work โ the athlete straining under a loaded barbell โ was a real photograph. What made it look real wasn't just the muscle definition or the sweat. It was the lighting: top-down, high-contrast, unforgiving. The kind of light a gym actually has. We needed our fitness scenes to have that same light, and now they do.
Framing Is a Directive, Not a Label
Every content calendar we generate includes diverse camera angles across posts โ a macro close-up for one post, a wide environmental shot for another, an over-shoulder for a third. The scene planner, powered by Claude, produces labels like "medium shot" or "low angle" for each post. Those labels were being passed directly to the image generation prompt.
The problem is that "medium shot" is a label, not a directive. Flux doesn't have a photography textbook. "Medium shot" conveys roughly the same information as saying "take a normal photo." The model fills in the rest arbitrarily.
We built a library of eight camera angle templates, each one a precise photographic instruction. The medium shot template specifies head-to-waist framing with the subject occupying fifty to sixty percent of frame height, positioned at rule-of-thirds, with the background identifiable but softly blurred at f/1.8โ2.8. The low angle template specifies camera below the subject's knee level, tilted upward twenty to forty degrees, with the subject appearing dominant against the sky or ceiling and strong perspective distortion on vertical lines. The flat lay template specifies camera axis perfectly perpendicular to the surface, zero perspective distortion, no horizon visible, even shadowless overhead lighting.
These templates are matched against the scene planner's output using fuzzy word matching โ so "wide environmental shot" and "wide shot" both resolve to the correct template โ and injected into the prompt at a position where the model is still paying close attention. A label that told the model almost nothing now tells it exactly where to put the camera, what proportion of the frame the subject should occupy, how to handle depth of field, and what relationship to establish between foreground and background.
The Smile Nobody Asked For
Every UGC prompt we generated instructed the model to show "an authentic asymmetrical smile, not posed." The result was that every AI influencer, in every post, was smiling directly at the camera. Warung scenes, gym scenes, skincare application scenes โ all of them showed a person frozen in the same pleasant, camera-aware expression.
Real photographs don't look like that. The athlete in the gym isn't smiling โ they're grimacing. The person trying food at a warung reacts with their eyes before their mouth. The person adjusting an outfit in a boutique is focused, not performing.
The instruction "authentic smile, not posed" was self-contradicting. Smiling at the camera is by definition posed. We replaced it with "genuine caught-in-action expression โ mid-laugh, mid-reaction, or natural focus โ NOT a posed smile to camera." The shift is semantic but consequential: the model now understands that the target is a moment being captured, not a portrait being staged.
Guidance as a Dial, Not a Setting
Two numeric parameters were quietly working against us. The PuLID face consistency model was running at guidance scale 3.5 โ a setting that tells the model to follow instructions with high fidelity. High guidance produces crisp, faithful outputs, but it also produces outputs that look generated. The model follows instructions so precisely that there's no room for the organic variation that makes faces look like photographs.
Similarly, our multi-IP-adapter composite generation was running at guidance scale 4.0 โ even higher. Every composite with an influencer face plus a product reference plus a moodboard style reference was being rendered at maximum instruction-following intensity. The result was images that felt assembled rather than photographed.
We lowered PuLID guidance to 2.8 and composite guidance to 3.0. The model now treats instructions as strong suggestions rather than rigid specifications. The influencer still looks like the reference. The product is still visible. The moodboard aesthetic is still present. But the rendering has space to breathe โ to add the slight randomness in lighting, skin texture, and composition that signals a real photograph rather than a render.
Compounding Small Fixes
None of these changes individually produces a dramatic transformation. A neutral film stock, scene-appropriate lighting, exact framing instructions, a different expression directive, and slightly lower guidance values โ each sounds incremental. But photorealism is multiplicative. Every small imperfection compounds with every other. And conversely, removing five small sources of artificiality simultaneously produces an image that reads as genuinely more real than any one fix would suggest.
The underlying lesson is that AI image generation responds to specificity. Not longer prompts โ specific ones. The difference between "natural window lighting" and "mixed overhead fluorescent and warm incandescent accent, green-white color temperature 4000K, hard shadows under chin and arms" isn't word count. It's the difference between a generic instruction and a photographic brief. The model is capable of following the brief. The work is in writing it.