JiwaAI
Blog
โ†All posts
engineering
pipeline
quality
reliability

How We Made Our Content Pipeline More Reliable: Quality Gates, Smart Retries, and Better Diagnostics

Jiwa AI Teamยท

Why Pipeline Reliability Matters

When you're generating AI content at scale โ€” captions, images, and delivering them via WhatsApp โ€” every silent failure is a missed opportunity. A caption that sounds generic, an image generation strategy that fails without logging which approach worked, or a WhatsApp message that disappears into the void โ€” these add up.

This week, we shipped a focused set of improvements targeting the reliability and observability of our content pipeline.

What We Improved

1. Caption Quality: Catching Generic Language Before It Ships

AI language models sometimes fall back to clichรฉd promotional phrases โ€” "game changer", "must-have", "you need this". These phrases feel inauthentic and undermine the influencer voice we work so hard to match.

We added a zero-cost deterministic anti-pattern check that scans every generated caption for forbidden phrases before they leave the pipeline. When a generic phrase is detected, it's flagged with a warning including the exact phrase and a caption preview. This gives our quality gate an early signal to trigger a retry.

We also improved malformed response diagnostics. When a batch caption generation returns incomplete data, we now log exactly which fields are missing per post (e.g., "index 3: missing variantA.caption and caption") instead of a generic "malformed" warning. This makes debugging significantly faster.

2. Quality Gate: Configurable Thresholds

Our quality gate scores every post on caption quality (70% weight) and visual quality (30% weight), then auto-retries posts that score below threshold.

Previously, the retry limits and score thresholds were buried as magic numbers in the code. We extracted them into named constants:

  • Caption retry threshold: 55/100 (posts below this get caption rewrites)
  • Image retry threshold: 50/100 (posts below this get image regeneration)
  • Review flag threshold: 40/100 (posts still low after retry get flagged for human review)
  • Max caption retries: Increased from 2 to 3 posts per batch

This makes it trivial to tune quality standards per business tier in the future โ€” a luxury brand can set a higher bar than an SME.

3. Image Generation: Strategy Observability

Our image orchestrator uses a multi-strategy fallback chain. For product posts: hybrid composite, IP-Adapter, then generic. For UGC posts: multi-IP-Adapter, PuLID face, then generic.

Previously, we only logged when a strategy failed. Now we log when each strategy succeeds, making it possible to answer questions like:

  • What percentage of product posts use the hybrid strategy vs. falling back to generic?
  • Is PuLID face reliability improving or degrading over time?

We also extracted the realism score threshold (minimum score to accept without retry) into a named constant, making it easy to adjust as our image models improve.

4. WhatsApp Delivery: Hardened Against Network Failures

Our WhatsApp delivery via Fonnte previously had no fetch timeout โ€” a slow API response could hang indefinitely. We added:

  • 30-second fetch timeout via AbortController โ€” no more indefinite hangs
  • HTTP status validation โ€” we now check res.ok before attempting JSON parse, catching 5xx errors that would otherwise produce cryptic parse failures
  • Structured response โ€” callers now receive the HTTP status code alongside the API response, enabling upstream retry logic

The Numbers

Pipeline Stage Before After Improvement
Caption Generation 4.0/5 4.2/5 Anti-patterns + diagnostics
Image Generation 3.5/5 3.7/5 Strategy logging + constants
Quality Scoring 3.5/5 3.8/5 Configurable thresholds
Content Delivery 2.5/5 3.0/5 Timeout + HTTP checks
Overall 3.5/5 3.7/5

What's Next

These improvements lay the groundwork for our next cycle of enhancements:

  • Adaptive quality thresholds per business tier
  • A/B variant tracking to learn which caption approach works best per influencer
  • Influencer data refresh to keep matching scores current
  • Integration tests for the core pipeline stages

Every critique cycle moves us closer to a pipeline that doesn't just generate content โ€” but learns what makes content great.