JiwaAI
Blog
โ†All posts
ai
image-generation
content-generation
engineering

When Your AI's Images and Captions Tell Different Stories

Jiwa AI Teamยท

A Beautiful Image That Nobody Asked For

Imagine scrolling through an Instagram feed and seeing a gorgeous photo of someone at a rooftop bar at sunset. The colors are warm, the composition is perfect, the person looks natural. Then you read the caption: "Nothing beats a lazy Sunday morning with fresh cookies and coffee."

The image is stunning. The caption is engaging. Together, they make no sense.

This is the problem we discovered hiding in our content pipeline. Our AI was generating high-quality images and well-crafted captions โ€” but nobody was checking whether they told the same story.

The Evaluation Blind Spot

Our quality scoring system was thorough in many ways. It checked whether captions matched the brand voice. It verified that images looked natural and contained the right product. It even detected unwanted AI-generated text in images. But every check operated in isolation.

The caption scorer evaluated text quality without seeing the image. The visual scorer evaluated image quality without reading the caption. The DNA evaluator checked brand alignment for both, but never asked the fundamental question: do these two pieces of content actually belong together?

In practice, this meant a post could score eighty out of a hundred while the image depicted an entirely different scenario from what the caption described. The individual pieces were good. The combination was incoherent.

Teaching the Scorer to Read and Look Simultaneously

The fix required rethinking how we evaluate visual quality. Instead of sending each image to our vision model with a generic quality checklist, we now inject the specific caption context alongside the image.

The scorer sees the caption, the overlay text, the product name, and the theme โ€” then evaluates whether the image actually depicts what the words describe. If the caption talks about an afternoon snack break at the office but the image shows someone at the beach, that mismatch now triggers a score below forty, which kicks off automatic regeneration.

We gave image-caption coherence twenty-five percent of the visual score weight, making it the second most important criterion after naturalness. This means a technically flawless image that does not match its caption will score poorly enough to be flagged and retried.

The Overlay Problem Nobody Talks About

While investigating evaluation gaps, we found another blind spot: text overlay readability. Our system was carefully analyzing images for safe text placement zones, applying gradient overlays for contrast, and auto-scaling font sizes. But after all that work, nobody checked whether the final result was actually readable.

The quality scorer had a line item for text readability buried at the bottom of its checklist with no explicit weight. An overlay placed over a busy background with insufficient contrast could pass scoring without anyone noticing.

We promoted overlay readability to an explicit criterion with its own weight. The scorer now specifically evaluates whether text stands out from its background, whether the gradient overlay provides sufficient contrast, and whether the font size is appropriate for the content. Images where the overlay is hard to read now score below fifty on that criterion, dragging down the overall score.

Diversity Is Not Just a Caption Problem

We also discovered that our diversity enforcement was one-sided. The system checked whether caption openings were varied across a batch โ€” no two posts should start with the same words. But it did nothing to prevent visual repetition.

A batch of five posts could all feature the same camera angle, the same indoor setting, and the same lighting. Each image individually looked fine. Viewed as a content calendar, they looked monotonous.

We added a deterministic diversity check that runs after scene planning, before any images are generated. It flags duplicate camera angles and duplicate settings within a batch. This costs nothing โ€” it is a simple comparison of the scene plan data we already generate. But it catches problems that would otherwise only surface when a human reviews the entire calendar side by side.

Small Gates, Big Impact

None of these changes required new AI models or expensive infrastructure. They are small additions to existing evaluation prompts and lightweight deterministic checks. The total cost increase per post is effectively zero โ€” we are asking the same vision model to check a few more things during the same call it was already making.

The lesson is that evaluation quality is not about adding more scoring calls. It is about making each call smarter by giving it the right context. A visual scorer that can see the caption catches problems that a visual scorer working in isolation never will.

We are continuing to strengthen these evaluation layers. Next on the list: automatically comparing our two caption variants and selecting the higher-scoring one, adjusting scoring weights by post type, and computing quantitative contrast ratios for text overlays rather than relying on subjective vision assessment.

Every evaluation gap we close means fewer posts that need human review โ€” and more content that is ready to publish the moment it is generated.