JiwaAI
Blog
โ†All posts
engineering
reliability
content-generation
whatsapp

When Your Pipeline Outlives Its Timeout โ€” Fixing Silent Deaths in AI Content Generation

Jiwa AI Teamยท

The Silent Death

Users would send /daftar on WhatsApp, connect their Instagram, and see the progress messages roll in: "Analyzing your brand..." then "Creating images... almost done!" And then โ€” nothing. No images. No error. No way to retry without waiting 10 minutes.

The pipeline had died mid-execution, killed by Cloud Run's 300-second timeout. No catch block runs when your process ceases to exist.

Why It Got Worse Over Time

Our image generation pipeline grew from "generate an image" to a multi-gate quality system. Each image now passes through text detection (is there AI-generated gibberish?), product fidelity scoring (does the generated product match the real one?), and realism gating (does this look like a photograph or a render?). Each gate can trigger up to 2 retries, each retry generating a new image and re-scoring it.

For 8 posts with physical products, worst case: 175 seconds just for image generation, plus 60 seconds for quality scoring and caption retries. Add 40-75 seconds for the earlier analysis waves, and the pipeline was regularly hitting 280+ seconds against a 300-second wall.

Time Budgets, Not Faster Code

The fix wasn't making each step faster โ€” it was making the pipeline aware of its own time budget.

At the entry point, we wrap the entire pipeline in a Promise.race against a 280-second hard timeout. Inside the pipeline, a timeBudgetExceeded() function checks whether less than 120 seconds remain. When time is short, the pipeline degrades gracefully: image quality gates skip their retries (accepting the first-pass result), the quality scorer skips caption retries and batch cohesion checks, and carousel generation follows the same rules.

The key insight: a slightly lower-quality image delivered to the user is infinitely better than a perfect image that never arrives.

Three Layers of Protection

Layer 1 โ€” Soft timeout (270s): Before starting expensive image generation, the pipeline checks elapsed time. If it's already past 270 seconds, it throws immediately rather than starting work it can't finish.

Layer 2 โ€” Time-budget-aware gates: The skipRetries flag flows from the orchestrator through to every quality gate. When the budget is tight, we still generate images and score them โ€” we just don't retry the ones that score low. They get flagged for manual review instead.

Layer 3 โ€” Hard timeout (280s): A Promise.race at the HTTP handler level ensures we always respond before Cloud Run kills us. On timeout, the business is marked as FAILED and the user gets a clear WhatsApp message: "Sorry, this took too long. Try again with /daftar."

Delivery Resilience

We also hardened the WhatsApp delivery phase. Previously, sending 10 image previews sequentially through Fonnte could take 30-50 seconds, with no timeout per image. Now each image send has a 15-second timeout (skip if Fonnte is slow), and the entire delivery phase has a 45-second budget. If delivery takes too long, we stop sending images and just send the dashboard link.

The Stale Threshold

One subtle fix: we reduced the "stuck business" detection threshold from 10 minutes to 6 minutes. Since maxDuration is 300 seconds, any business still in GENERATING status after 6 minutes is definitely dead. Users can now retry faster instead of waiting the full 10 minutes.

The Swallowed Error

We also found a secondary bug: if image generation threw an exception, the error was caught and logged โ€” but the pipeline continued to Wave 9, creating posts with empty captions and blank images, then delivering those blank posts to WhatsApp. Now, if generation produces zero posts, we skip delivery entirely and send an error message instead.

Reliability as a Feature

This change adds zero new features to the content generation pipeline. The images are the same, the captions are the same, the quality gates are the same. What changed is that users always get a response โ€” either their content or a clear message telling them to retry. In a WhatsApp-first product, silence is the worst possible UX.