Why We Stopped Pasting Products and Let AI Do the Holding

The Floating Product Problem

For months, our UGC pipeline worked like this: generate an AI scene of an influencer with their hands positioned as if holding something, then paste a product cutout on top. The product photo was real, the scene was AI-generated, and the result was... obvious.

Products floated slightly above hands. Lighting on the product didn't match the scene. Shadows fell at wrong angles. The influencer's grip looked unnatural because they were never actually holding anything — we just placed a product image where we hoped their hands would be.

It worked well enough for product-on-background shots. But for UGC — where the entire point is an influencer authentically using a product — the artificiality undermined everything.

The Compositing Complexity Tax

The old approach wasn't just visually flawed — it was architecturally complex. We maintained a library of expression-to-placement mappings: if the influencer was "eating," place the product at one position and scale; if "pointing," shift it to another. Over thirty-five different expression types, each with hand-tuned coordinates.

We categorized influencers into scene types, extracted cutouts, matched lighting between layers, added procedural drop shadows, and blended edges. Five different composition strategies tried to handle every combination of influencer type, expression, and product placement.

All of this machinery existed to solve one problem: the AI couldn't render products accurately. So we kept the product real and composited it into AI scenes. It was clever engineering, but it was fighting the wrong battle.

The Insight: We Already Know What Products Look Like

The turning point came when we looked at what data we already had. During onboarding, our vision analysis system extracts detailed product DNA — what the product looks like, its shape, packaging colors, and crucially, how a person naturally holds it.

"Person holds the slim protein bar in one hand, packaging visible with branding facing camera." That's not a vague description — it's a specific enough instruction for an AI image generator to work with.

We had been telling the AI "do NOT render any product" and then pasting one on afterward. What if we just told the AI what the product looked like and let it generate the whole scene — influencer, product, and interaction — in one shot?

Letting the AI Do What AI Does Best

The new approach is dramatically simpler. Instead of five composition strategies, we have one for UGC: tell the AI to generate the influencer naturally holding and interacting with the product, using the detailed product descriptions we already extract during onboarding.

The prompt now includes the product's visual description, its shape, and how it should be held. The face consistency technology ensures the influencer looks right. And the scene template system provides the authentic setting.

One generation call. No compositing. No expression mappings. No cutout extraction. No lighting matching. No shadow generation.

The Results: Less Code, Better Images

We removed over four hundred and fifty lines of compositing logic. Five template strategies became three. The thirty-five expression-to-placement mappings? Gone entirely. The influencer categorization system that separated "scene" influencers from "clean background" influencers? Unnecessary.

But the real win is visual quality. When an AI generates an influencer holding a product in a single pass, the lighting is consistent, the grip looks natural, and the product integrates into the scene the way a real product would. There's no uncanny boundary between "real product" and "AI scene" because there's no boundary at all.

Trade-offs We Accepted

This approach isn't without compromise. The composited product cutout was pixel-perfect — the exact product photo, every logo and color accurate. AI-generated products are close but not identical. The shape and color will be right, but fine packaging details might vary.

For UGC content, this trade-off makes sense. Nobody scrutinizes packaging text in an influencer's hand. What they notice is whether the interaction looks natural — and that's where the new approach wins decisively.

Simpler Systems Scale Better

The deeper lesson is about architectural simplicity. The old system was an impressive feat of engineering — image segmentation, anchor point detection, lighting analysis, multi-layer compositing. But it was complex because it was compensating for a limitation we could work around differently.

Sometimes the best refactor isn't making the existing system better. It's stepping back and asking whether the system needs to exist at all.