JiwaAI
Blog
โ†All posts
ai
content-generation
product-design

Why We Score Our Own AI's Work (And Why It's Not Enough)

Jiwa AI Teamยท

The Self-Evaluation Problem

Here is a fundamental tension in AI content pipelines: you need quality control, but the only thing fast enough to evaluate AI output at scale is another AI. Hiring humans to review every generated post before delivery would destroy the cost structure that makes the product viable. So we built an AI scorer that evaluates every post our AI generates.

Yes, this is circular. A model from the same family scoring output from its sibling. The research literature suggests self-evaluation correlates with human judgment at best moderately. We know this. We use it anyway, because the alternative โ€” no quality gate at all โ€” is worse.

Three Layers of Scoring

Our quality gate isn't a single number from a single prompt. It blends three independent evaluation dimensions.

The first layer scores captions. A separate AI call evaluates each post's text on five weighted criteria: whether the product placement feels authentic rather than forced, whether the caption matches the brand's documented voice, whether it sounds like the assigned influencer, whether hashtags and calls to action are Instagram-optimized, and whether any forbidden keywords appear. This produces a zero-to-hundred score per post.

The second layer scores images. Each generated image gets a vision-based review checking for AI artifacts โ€” distorted faces, extra fingers, hallucinated text โ€” plus brand color adherence, product visibility, and composition quality. This runs in parallel with the caption scoring, one cheap vision call per image.

The third layer is DNA evaluation. This compares the combined image and caption against the brand's documented identity โ€” personality traits, voice guidelines, sample phrases โ€” and the influencer's documented style. It catches subtle misalignment: a caption that's technically good but doesn't sound like the assigned influencer, or an image that's well-composed but uses the wrong color mood.

The final score blends all three: sixty percent caption quality, thirty percent visual quality, with the DNA evaluation mixed in at a forty-sixty ratio with the caption score.

What Happens Below the Line

Posts scoring below fifty-five get their captions automatically regenerated. The system feeds back the original caption, the quality score, and the specific criticism, then asks for an improved version. The improved caption gets re-scored to prevent artificial inflation.

Posts still below fifty after caption retry get their images regenerated. The system replays the same generation strategy โ€” same model, same reference images โ€” but with enhanced negative prompts targeting the specific defects identified in the visual review. If the retry scores higher, it replaces the original.

Posts that remain below forty after both retries get flagged for human review. The system has admitted it can't fix this one automatically.

Where the Scores Are Meaningless

Our scores predict engagement about as well as a coin flip. We know this because we built the correlation infrastructure โ€” tracking which quality scores correspond to which engagement rates after publishing. The correlation is effectively zero, and the reason is embarrassing: we never stored the quality scores on the post records. The correlation function was using a hardcoded placeholder value for every single post.

This is the kind of bug that reveals a deeper truth. We built a sophisticated multi-dimensional quality gate, but we never closed the feedback loop. The scores influenced which posts got regenerated, but we had no way to learn whether our scoring criteria actually predicted what performs well on Instagram.

What Would Make It Real

The missing piece is human calibration data. When business owners approve or reject posts via WhatsApp, that's a quality signal we already capture. When posts get published and we track engagement rates, that's another signal. Connecting these signals back to the AI quality scores would tell us which scoring dimensions actually matter and which are noise.

We're also exploring cross-model evaluation โ€” using a different AI provider to score output, breaking the self-evaluation loop. The theory is that systematic biases are model-specific, so a different model's biases won't align with the generator's biases in the same way.

For now, our quality gate catches the obvious failures: hallucinated text in images, captions that mention products the business doesn't sell, influencer voices that sound nothing like their documented style. It misses the subtle ones: posts that are technically correct but emotionally flat, images that are well-composed but forgettable.

The honest assessment is that AI quality scoring is necessary but insufficient. It's a safety net, not a quality standard. The standard still comes from human taste, and we haven't built the infrastructure to learn from it yet.