Schema Drift: The Invisible Production Failure No Log Will Warn You About

When the Error Message Lies

There is a special kind of production failure that is worse than a crash. It is the failure that looks like a different problem entirely.

Our dashboard was showing "Business not found." to users who had just created a business and were clicking on it from their own list. The business clearly existed — they had just onboarded. The dashboard showed it. But the detail page refused to load it.

The error message was wrong. The business was found. It just could not be fetched.

The Real Culprit: Schema Drift

Our infrastructure deploys through Google Cloud Run. When we merge to main, a new container image is built, the application is compiled, and a fresh Prisma client is generated from the current schema. The database, however, is not automatically updated.

This works well for most schema changes. When we add a new table or column, we run a database push as part of the release process. What broke was the gap between "the process we intended to follow" and "the process the CI pipeline actually enforces."

A recent release added two schema changes — a new JSON field on posts for storing quality evaluation results, and a new table for tracking asynchronous reel generation jobs. The Prisma client in the new container expected these to exist. The production database did not have them. Every query against the posts table failed at the database level with a column-not-found error.

That error was caught by our API's catch block. The catch block returned a 500. The client page treated any non-200 response as a missing business. The user saw "Business not found."

Three layers of abstraction, each doing the right thing locally, produced a deeply misleading outcome globally.

Why This Class of Failure Is Easy to Miss

Schema drift is invisible before deployment because the Prisma client is generated at build time and type-checks pass. The client believes the columns exist because it generated the types from the schema that says they should exist. TypeScript cannot verify runtime database state.

It is also invisible in local development, where developers run database pushes as a matter of course. The gap only appears in production — or in staging environments where the database is treated as a permanent fixture rather than a schema-managed artifact.

The deeper issue is that our pipeline had a manual step in a mostly automated flow. Every other part of deployment — building, compiling, pushing to the registry, updating the Cloud Run service — happens without human action. Database schema sync required someone to remember. Humans forget.

Making the Implicit Explicit

The fix was mechanical once the root cause was clear: push the schema before the release completes. But the more durable fix was to acknowledge that the deployment pipeline itself needed to represent this dependency.

A release is not complete when the container is deployed. It is complete when the container is deployed AND the database schema matches what the container expects. Until that is true, the release is in a broken intermediate state.

We also took the opportunity to improve error propagation in the client. A 500 response from the server does not mean the business is missing. It means something went wrong. These are different situations and they deserve different messages. We updated the UI to distinguish between a business that does not exist, a request that was unauthorized, and a server failure that the user should retry. Small change, significant difference in what a user understands when something goes wrong.

The Broader Pattern: Build Pipeline Trust

Our pipeline for generating AI content has extensive resilience built in. Image generation has a five-level fallback chain. Post generation continues even when individual steps fail. The entire onboarding pipeline is designed to degrade gracefully rather than fail completely.

The infrastructure deployment pipeline had not received the same care. We had invested heavily in making the AI system resilient to partial failures but had a single-point dependency in the deployment process — a manual step that could be forgotten.

This is a common asymmetry. Product features get careful design. Deployment processes often grow organically and accumulate manual steps that seem harmless until they cause an incident.

Good infrastructure is designed with the same intentionality as good product architecture. Every manual step in a deployment is a future incident waiting to happen.

Looking Ahead

We are incorporating schema sync into the automated deployment pipeline so it runs as part of every release. We are also extending our approach to surface schema drift in pre-deployment checks — comparing the generated client's expectations against the live database before traffic is shifted to a new revision.

If your deployment pipeline has a step that depends on a human remembering something, that step deserves the same attention as any other failure mode in your system. Consistency comes from making the right thing the automatic thing.