One Image, Three Strategies, Zero Wasted Calls

The Scattered Pipeline Problem

Our image generation code had grown organically. Product posts took one path through the codebase. Influencer UGC posts took another. Carousel covers took a third. Each path had its own prompt builder, its own fallback chain, and its own way of handling failures — mostly by silently swallowing them.

The result was three parallel implementations doing roughly the same thing: construct a prompt, call Fal AI, handle errors, overlay text, persist to storage. When we wanted to improve prompt quality or add a new generation strategy, we had to change it in three places. When something failed, we had no visibility into which strategy was attempted or why it broke.

We needed a single orchestrator that understood the intent behind each post type and picked the right strategy automatically.

Routing by Context Type

The composite image orchestrator is a single function that takes a context type — Product Post, UGC Post, or Carousel Post — and runs the appropriate generation cascade.

For product posts, it tries the highest-fidelity approach first: generate an AI background, remove the product's background via BiRefNet, and composite the real product cutout on top. This produces the most accurate product representation because the actual product photo is preserved. If that fails, it falls back to IP-Adapter generation, which is cheaper but renders an approximation of the product. If that also fails, it generates a generic product scene.

For UGC posts, the cascade starts with multi-IP-Adapter generation — face and product references merged in a single Fal AI call. This is the best quality when both references are available. If the business has no product images, it skips straight to PuLID face-only generation, saving the three cents that flux-general would have cost with no benefit. Generic generation is the final fallback.

For carousel covers, it generates an influencer-led hook image via PuLID — no product visible, to build curiosity.

Every fallback now logs what failed and why, instead of silently catching errors. When we review generation logs, we can see exactly which strategy each post used and where the cascade stopped.

The Prompt Length Problem

A subtle quality issue had been hiding in plain sight. Our Fal AI prompts include brand colors, visual style, mood board guidance, camera settings, and anti-text instructions. A fully loaded UGC prompt easily reaches eighteen hundred characters. But we had a hard limit of fifteen hundred characters, silently truncating everything past that point.

The truncated portion was almost always the brand color instructions and the camera quality directives — the parts at the end of the prompt that determine whether the generated image looks like it belongs to this brand or looks like generic stock photography.

Increasing the limit to two thousand characters costs nothing — Fal charges per image, not per prompt token. The quality improvement was immediate: brand colors started appearing consistently in generated images.

Knowing When Not to Analyze

Our safe zone analyzer uses Claude Haiku Vision to find the best text placement region in each generated image. At two-tenths of a cent per call, it's cheap individually. But across a six-post batch with four images needing text, it adds up to nearly a cent of pure overhead.

The insight was that most of the time, we already know where text should go. The caption generation step predicts a text position for each post. Carousels always use center positioning because the background is blurred for readability. The only case where vision analysis actually helps is when the position is marked as "auto" — meaning the system couldn't predict it from context.

Now we only call the safe zone analyzer when position is genuinely unknown. For a typical batch, this eliminates three or four vision calls, saving six-tenths of a cent per batch with no quality impact.

Teaching the Calendar About Ramadan

The pipeline had no awareness of Indonesian cultural events. A content calendar generated during Ramadan would miss the opportunity for iftar moments, sahur themes, and Lebaran gifting angles. Worse, it might suggest images of daytime eating — culturally inappropriate during the fasting month.

We added a static lookup table of major Indonesian events with hardcoded dates for 2026 through 2028. Ramadan, Lebaran, Hari Kemerdekaan, Imlek, Natal. Each entry includes a context string that gets injected into the calendar generation prompt: what themes to lean into, what to avoid, what opportunities exist.

For captions, we added guidance on casual Indonesian social media language — using "gak" instead of "tidak", natural use of "banget", "sih", "dong". The kind of language real Indonesian influencers use that a language model defaults away from when it's being polite.

This is a hardcoded table, not a dynamic API call. It costs nothing. It shifts the needle on cultural relevance more than any model upgrade would.

Preventing Mismatched Pairings

The calendar generator sometimes paired a fitness influencer with a dessert brand, or a beauty influencer with a sports drink. The influencer-brand matching scored well at the aggregate level, but individual product assignments didn't account for niche compatibility.

The fix is deterministic. After the calendar is generated, we check each slot: does the assigned influencer have any niche keywords that overlap with the product's content keywords? If not, and another influencer in the pool has overlap, we swap. No AI call needed — just string matching on arrays that already exist in the data model.

We added the influencer's niche list to the match interface and injected it into the calendar prompt as well, so the AI makes better initial choices. But the post-processing swap catches what the AI misses.

Verifying What the AI Generated

Quality scoring had a blind spot: it only evaluated captions. A post with a perfect caption but a terrible image — wrong colors, distorted face, product not visible — would score ninety and ship.

We added a visual verification step using Claude Haiku Vision. After images are generated, a single vision call checks the first image in the batch for product visibility, brand color presence, artifacts, and composition quality. The visual score is merged with the caption score at a seventy-thirty weighting.

This runs in parallel with caption scoring, adding no latency. The cost is two-tenths of a cent per batch. For carousels, a separate vision check evaluates the hook slide and re-overlays with stronger contrast if the text is unreadable.

Retrying What Scored Poorly

Previously, a post that scored thirty-five out of a hundred would be saved and shipped alongside posts scoring eighty. The quality score existed but had no consequences.

Now, posts scoring below fifty-five trigger a caption-only retry. The retry prompt includes the original caption and the specific quality feedback: "scored 42 — product feels forced into fitness context." A single Haiku call rewrites the failing captions with that feedback as guidance.

We cap this at one retry round and two posts per batch. We never regenerate images — caption generation costs a tenth of a cent via Haiku, while image regeneration costs three cents. The economics only make sense for text.

The Compound Economics

Each optimization is small. Skipping unnecessary safe zone calls saves six-tenths of a cent. The dedup cache saves three cents when it triggers. The auto-retry costs a tenth of a cent when needed. The visual verification adds two-tenths of a cent.

Together, the net cost per batch actually decreased by four-tenths of a cent while quality improved across five dimensions: prompt fidelity, cultural relevance, pairing accuracy, visual verification, and caption retry.

The budget for all new quality features was one cent per batch. We came in at six-tenths of a cent added, more than offset by the cost reductions. The pipeline generates better content and costs less to run. That's the kind of optimization worth shipping.