Why Six Captions in One Call Beats Twelve

The Obvious Approach Has a Hidden Cost

When you first think about generating captions for a content calendar, the logic seems straightforward: take one post, send it to the model, get a caption back. Repeat for every post in the calendar. Twelve posts, twelve calls, done.

It works. But it's slow, it's expensive, and — perhaps most surprisingly — it produces worse captions.

We learned this firsthand building Jiwa AI's caption pipeline, and fixing it taught us something useful about how language models actually behave when they can see the full picture.

What the Naive Approach Costs You

When each caption is generated in isolation, the model has no memory of what it just wrote. Every call starts with the same brand context: the influencer's voice, the product details, the brand tone, the target audience. That context is injected fresh with every single request, even though it doesn't change between posts.

For a six-post content block, you're paying for that system context six times over. The token overhead compounds quickly, and since LLM calls are network-bound, running them sequentially means your calendar generation time grows linearly with the number of posts.

In practice, we measured the caption phase taking roughly three times longer under the sequential approach than under our current batched architecture. That gap matters when a business owner is sitting on the onboarding page waiting for their content.

One Call, All Six Captions

The fix is conceptually simple: instead of sending one post scenario at a time, we send all of them together. The model receives the full set of post descriptions — influencer, product, theme, and scene — and returns a structured response containing all captions at once.

The system context is injected exactly once. The model processes all scenarios in a single forward pass. We get back a clean, parseable array of captions that maps directly onto our calendar slots.

This cuts the caption phase latency by around 70% and meaningfully reduces the total token spend per onboarding. At our scale, where every generation run has to be cost-efficient enough to serve small and mid-sized Southeast Asian businesses, that's not an optimization we could afford to skip.

An Unexpected Benefit: Better Diversity

Here's the part we didn't anticipate. When the model can see all six post scenarios simultaneously, it naturally avoids repeating itself.

In sequential generation, there's nothing stopping the model from opening three consecutive captions with the same energy, the same hook structure, or even near-identical phrasing. It has no awareness of what it already wrote. The result can feel formulaic — technically correct captions that collectively lack range.

Batched generation changes the dynamic. The model processes all six scenarios as a unit, and in doing so, it applies contrast organically. A playful caption for a post-workout scenario lands differently alongside a more aspirational caption for a lifestyle moment. The model finds that contrast on its own, without us having to explicitly instruct it to vary its output.

We think of this as in-context diversity: the captions are better because the model can see them all at once, the same way a copywriter would write a whole campaign rather than one ad at a time.

The Design Constraints It Introduces

Batching isn't free of tradeoffs. There are two real challenges it surfaces that sequential generation sidesteps entirely.

The first is voice consistency across six distinct posts. The model needs to write in the influencer's voice throughout the batch, even as the scenarios shift — from an action-oriented product moment to a quieter, more reflective post. We handle this by including the influencer's real Instagram content in the prompt, giving the model concrete examples to calibrate against rather than relying on abstract tone descriptors.

The second challenge is partial failure handling. With sequential calls, a failure is isolated: one post fails, you retry that one call. With a batched call, if the model returns malformed output, you risk losing all six captions at once. Our system validates the structured response field by field. If individual captions fail validation, we surface those failures at the post level rather than discarding the entire batch — the posts that passed are kept, and only the failed ones are flagged for retry.

This is the trade-off you have to accept: batching wins on speed and cost, but it demands more careful error handling on the receiving end.

Why This Matters Beyond Captions

The batching pattern isn't unique to our caption step. We apply the same thinking to quality scoring — rather than scoring each post in a separate call, we evaluate all posts together in one structured request. The model can calibrate its scores relative to the batch, which produces more consistent and meaningful differentiation between high- and low-quality posts.

The broader principle is that LLMs work better with more context, not less. Giving the model the full picture — all six scenarios, the complete calendar — allows it to reason about the content as a coherent whole rather than as disconnected fragments. That's valuable for quality, and it happens to be cheaper and faster at the same time.

Where We're Taking This

Caption batching was one of the first places we applied this thinking. It won't be the last. As we expand the content types Jiwa AI generates — more carousel formats, richer persona development, longer planning horizons — the opportunity to batch and contextualize grows with it.

The goal is a system that thinks about your content the way a strategic creative team would: seeing everything at once, finding the through-lines, and producing work that holds together as a whole. Getting the infrastructure right is what makes that possible.