Fidelity First: Why We Moved Product Truth to the Start of the Pipeline

The Blind Inspector Problem

For a while, our quality gate was doing something subtly broken: it was judging whether a generated image faithfully depicted a real product — without ever being shown the real product.

The inspector had a name. It had a description. It knew the color palette. But it had never seen the actual packaging. So when it evaluated a generated post featuring a biscuit brand, it was doing the equivalent of a food critic reviewing a meal they'd only read about on a menu. Confident-sounding, but fundamentally ungrounded.

This turned out to be the wrong framing entirely. The real problem wasn't that our quality check was misconfigured. It was that we were trying to catch fidelity failures at the end of a pipeline that had been given no reason to preserve fidelity in the first place.

The Downstream Trap

The conventional instinct in AI pipeline design is to generate first and evaluate later. You build a quality gate that catches bad outputs, retries the failures, and eventually ships the ones that pass. It's a sound approach for many quality dimensions — visual artifacts, composition, caption tone. These are properties that emerge from generation and can only be assessed after the fact.

Product fidelity is different. It's not an emergent property. It's a constraint. And constraints belong at the beginning of a process, not the end.

When we asked our caption model to write about a physical product, it had access to a name, a description, and a list of keywords. What it didn't have was a clear statement of what it absolutely could not do. Without that boundary, the model did what large language models do with underspecified tasks: it extrapolated. A biscuit became a "mid-workout energy source." A skincare cream gained properties lifted from similar products in training data. The captions weren't malicious hallucinations — they were just confident completions in the absence of constraints.

Priming as a First-Order Obligation

Our fix was conceptually simple: tell the model, before anything else, what kind of task it's actually doing.

For batches that include physical products, the generation prompt now opens with a direct statement of obligation. Not "try to be accurate" buried in a list of scoring criteria, but "before you write a single word, understand that these are real physical products with real packaging, and every claim you make must be grounded in the data you've been given." It names the specific failure modes — inventing sensory attributes, making performance claims not in the product description, reframing the product's identity — and frames them as non-negotiable constraints rather than nice-to-haves.

The same principle applies at the image side. Product descriptions from visual analysis — actual documented appearance notes, not inferred characteristics — are now surfaced explicitly alongside the product name. The model knows what the product looks like, not just what it's called.

Teaching the Inspector to Look

With generation primed for fidelity, we could finally make the quality inspector actually useful.

The key change was providing the reference. When our vision-based quality check now evaluates a generated image featuring a physical product, it receives two images: the generated post and the real product photograph side by side. It's no longer comparing a generated image against a mental model assembled from a name and some keywords. It's doing a direct visual comparison.

The scoring prompt reflects this shift. For physical products with a reference available, fidelity becomes the primary evaluation axis — weighted more heavily than naturalness, more heavily than caption coherence. If the generated product shows the wrong packaging shape, wrong dominant colors, or hallucinated branding, that post fails regardless of how aesthetically pleasing the overall image is. A beautiful scene featuring the wrong product is useless.

For posts where a reference image isn't available, the scoring degrades gracefully: we evaluate plausibility against the documented description rather than against ground truth. For service businesses with no physical product at all, fidelity scoring is bypassed entirely. The system adapts to what it actually knows.

Cleaner Conditioning Signals

There was a parallel problem on the influencer side of the pipeline. We were using reference photos of influencers to condition image generation — the idea being that the AI would learn the influencer's face and style from examples and apply them to new scenes. But those reference photos came with backgrounds. A studio backdrop, a street scene, a restaurant interior.

When you feed an image with a busy background as a conditioning signal, the model picks up everything — the person and the scene. That scene leaks into generations. Subtle artifacts appear. Colors shift. The conditioning becomes about the full image rather than the subject.

The fix was to strip those backgrounds before the reference image ever reaches the generation step. We now run influencer reference photos through a portrait segmentation model that produces a clean subject-only cutout. The background is gone. The conditioning signal is purely the person.

When the primary segmentation model struggles — sharp hair edges, unusual lighting conditions — a secondary approach using bounding-box segmentation takes over. The results are cached so the same influencer photo is only processed once across all posts that use it.

Fidelity as a System Property

The deeper lesson here is about where constraints belong in an AI pipeline.

It's tempting to think of quality evaluation as a separate, downstream concern — something you do after generation to filter good outputs from bad. And for some properties, that's right. But for constraints that should never be violated in the first place, evaluation-after-the-fact is too late. By the time you're scoring outputs, the model has already learned what kind of task it's doing and generated accordingly.

Physical product fidelity is a constraint. Influencer identity is a constraint. The right place for constraints is before generation starts — in the prompt, in the conditioning signal, in the model's understanding of what success means. Evaluation can reinforce that, but it can't substitute for it.

When you prime the model correctly, you don't need the quality gate to do as much heavy lifting. And when the gate does find failures, it's catching genuine edge cases rather than systematic drift from an underspecified task.

Jiwa AI generates AI-powered influencer content for brands across Southeast Asia. If you're building products that require accurate, hallucination-free representation of real physical goods, we'd be glad to share what we've learned.