Testing an AI Pipeline in 350 Milliseconds

Fifty Functions, Zero Tests

Here is an uncomfortable truth about AI-heavy applications: they are hard to test, so most teams don't test them. Our content generation pipeline — the system that turns a business URL into a full Instagram content calendar — had over fifty functions across a dozen modules. It called Claude for brand analysis, Fal AI for image generation, Supabase for storage, and Fonnte for WhatsApp delivery. None of it had unit tests.

We had end-to-end tests that verified the web interface worked. Those caught UI regressions. But the business logic underneath — phone number normalization, message template interpolation, prompt construction, quality scoring thresholds — ran untested in production. Every change was a gamble.

The Testing Paradox in AI Systems

The challenge with testing AI pipelines is that the most critical functions call external AI services. You can't assert that Claude will return the exact same JSON every time. You can't verify that Fal AI will generate the same image from the same prompt. The outputs are non-deterministic by design.

Many teams respond to this by writing no tests at all, or writing only integration tests that call the real APIs. Both approaches have problems. No tests means you discover bugs in production. Integration tests against live APIs are slow, expensive, and flaky — a single Claude call costs cents and takes seconds, and the response varies.

We took a third approach: test the boundaries, not the AI.

Testing the Seams

An AI pipeline has deterministic seams between its non-deterministic core. Phone numbers get normalized before any API call. Messages get interpolated from templates. Prompts get assembled from structured inputs. Quality scores get compared against thresholds. Retry logic follows fixed rules. These seams are where bugs actually live.

When a WhatsApp message arrives, the phone number goes through normalization: stripping spaces and dashes, converting local Indonesian format to international, fixing double country codes, validating digit count. This function doesn't call any AI. It's pure logic with edge cases. We wrote eight tests covering Indonesian numbers, international formats, already-clean inputs, and malformed strings. They run in three milliseconds.

Our message system interpolates bilingual templates with variable substitution. A template like "Content for {name} is ready! {count} posts created." needs to handle missing variables, numeric interpolation, and language switching. Five tests cover the critical paths. Two milliseconds.

Together with our existing tests for engagement correlation math, hashtag tier enforcement, product positioning guards, and composite shadow calculations, we now have thirty-nine unit tests. The entire suite runs in under 350 milliseconds.

What We Don't Test (And Why That's Fine)

We don't test that Claude returns valid JSON from our brand analysis prompt. We don't test that Fal AI generates an image without artifacts. We don't test that the Instagram Graph API accepts our media containers. These are integration concerns that belong in end-to-end tests with mocked responses, not in unit tests.

What we do test is everything around those calls: the inputs we construct, the outputs we validate, the fallback logic we execute when calls fail, and the data transformations we apply to results. If the prompt builder produces the wrong structure, the AI call will return garbage regardless of the model's capability. If the phone normalizer strips too many digits, the WhatsApp notification will never arrive.

The most common bugs in our pipeline have never been in the AI itself. They've been in the glue code: a function signature that accepted three parameters but was called with five, a database query that fetched every record instead of filtering by phone number, a message template that was hardcoded in one language when the user expected another. These are exactly the bugs that fast, cheap unit tests catch.

Making Tests a Gate, Not an Afterthought

Adding tests is only half the battle. If nobody runs them, they decay. We added test scripts to our project configuration and a dedicated unit test job in our continuous integration pipeline. The unit tests now run before the build step — if they fail, nothing else executes. This means a broken phone normalizer blocks deployment just as effectively as a type error.

The entire CI pipeline — typecheck, unit tests, build, and end-to-end tests — runs in under eight minutes. The unit test step adds less than ten seconds to that total. For that cost, we get confidence that the deterministic boundaries of our AI pipeline work correctly every time we push.

The 350-Millisecond Feedback Loop

The real value isn't in the tests themselves. It's in the feedback loop. When we change how trending topics get threaded into scene planning prompts, we can verify in under a second that the prompt builder still produces valid output. When we refactor the message system, we know immediately if we broke interpolation. When we add a new country code to our phone normalizer, we confirm it doesn't break Indonesian numbers.

Fast tests encourage more testing. When your test suite takes ten minutes, you run it once before a pull request. When it takes 350 milliseconds, you run it after every change. That's the difference between catching bugs in code review and catching them while you're still writing the code.