When 40% Actually Means 12%: Fixing Hidden Bias in AI Scoring

The Number That Looked Right

For months, our quality gate scored every generated image against a set of weighted criteria. At the top of that list, product fidelity was designated the primary evaluation axis with a 40% weight. That was intentional. For a retail brand, whether the product in a generated image actually looks like their product is non-negotiable. Everything else is secondary.

The problem was that this 40% never quite matched what we saw in practice. Images with subtle product inaccuracies were passing the gate. The fidelity criterion existed. The weight was correct in the scoring rubric. But the outputs told a different story.

Two Layers of Blending

The issue was not a bug. It was arithmetic that looked correct in isolation but compounded in a way nobody had traced end-to-end.

Our final quality score was a blend of two sub-scores: a caption score and a visual score. We used a 70/30 split — captions carry more weight because captions are what drive engagement on Indonesian social media platforms, where our primary market operates.

Within the visual score, product fidelity held that 40% weight.

What the critique surfaced was the multiplication: 40% of the visual score, times the 30% visual weight in the final blend, equals 12% effective influence on the final number that determined whether a post passed or failed. Not 40%. Twelve.

The designation "primary evaluation axis" was aspirational. The math was doing something else entirely.

Why This Goes Unnoticed

Composite scoring systems hide this kind of drift naturally. Each layer looks reasonable. A 70/30 caption-to-visual split makes sense. A 40% fidelity weight within visual makes sense. The problem only appears when you trace the full chain and ask: what does this criterion actually need to score to move the final number?

For a physical product image with a serious fidelity failure — wrong packaging, wrong color, invented features — the caption score could be excellent and the remaining visual criteria could be average, and the post would still pass. The gate was never actually testing what it claimed to test.

This is a specific failure mode in multi-layer scoring: local correctness, systemic blindness. Every team member who reviewed the scoring rubric saw a sensible set of weights. Nobody had drawn the full dependency graph on a whiteboard.

Fixing the Blend, Not the Rubric

The fix we chose was targeted. Rather than restructuring the entire scoring system — which would have required recalibrating every criterion and updating all the thresholds — we adjusted the blend for one specific case.

For posts featuring physical products, the caption-to-visual split now uses a 55/45 ratio instead of 70/30. The fidelity criterion inside the visual score stays at 40%. The multiplication now produces a meaningful effective weight that matches the stated intent.

For service businesses — cafes, salons, fitness studios — the original blend holds. Those posts have no physical product to verify, so fidelity is correctly subordinated to caption quality and aesthetic coherence.

The change is small by line count. The impact is that the scoring system now actually enforces what it says it enforces for the clients where it matters most.

Enforcing What You Measure

The broader lesson this surfaced was about the relationship between stated priorities and implemented behavior. Our critique cycle identified this by asking a simple question: if I intentionally generate an image with the wrong product, what score does it receive? When the answer came back above the passing threshold, we knew the rubric and the implementation had diverged.

This kind of audit — deliberately violating a constraint and observing whether the gate catches it — is now part of how we validate scoring changes. A criterion that cannot be violated is not really a criterion.

Looking Ahead

The same audit revealed two other places where rotation logic was more aspirational than enforced: content angle diversity in calendar generation, and visual format distribution when certain post types were dropped. Both have been tightened with explicit per-slot assignment rather than prose instructions that AI models interpret loosely.

Composite AI systems accumulate these gaps over time. Each component looks correct. The emergent behavior drifts. Regular end-to-end audits — not component tests, but tests that trace intent through to outcome — are the only way to catch what the unit tests cannot see.