Teaching AI What Not to Say in Indonesian

The Language the Algorithm Loves

AI language models are optimized to produce engaging, persuasive text. Left unconstrained, that optimization tends toward urgency, scarcity, and bold claims — the same rhetorical patterns that dominate low-credibility advertising. In English, this is easy to spot. In Indonesian, it's more nuanced, and the consequences for brand trust are real.

When we first built our caption quality layer, we identified eight Indonesian phrases that our pipeline should never produce: explicit urgency triggers, unverifiable health claims, and manipulative scarcity tactics. The list covered the obvious cases. It did not cover what we actually encountered in production.

From 8 to 26: What We Missed

The original eight phrases were the ones any marketer would know to avoid. But Indonesian social commerce has its own vocabulary of trust manipulation — phrases that feel native and conversational but carry the same manipulative weight as their more obvious counterparts.

Phrases suggesting guaranteed satisfaction without conditions, clinical-sounding efficacy claims without any basis, promotional pricing language that implies artificial urgency, and stock scarcity signals that create pressure without information. None of these appeared in our original list. All of them appeared in our generated captions.

We discovered most of them not through automated testing but through a critique cycle — a systematic pass through the pipeline looking for anything that would make a skeptical Indonesian consumer distrust the brand. The gaps were humbling. Eighteen new phrases were added.

Warn-Only Was the Wrong Default

Here's where the problem compounded: even for the phrases we were already detecting, our system was configured to warn rather than block and retry. If a generated caption contained a flagged pattern, the pipeline would log a warning and continue. The caption would ship.

The reasoning behind warn-only made sense at the time. We didn't want the pipeline to become over-zealous and reject good captions on false positives. Better to observe the rate of violations, tune from data, and upgrade to blocking later.

The problem is that "later" has a way of not arriving. Warn-only logs accumulate, and without a human actively monitoring them, they become background noise. Captions with anti-patterns kept shipping. The warnings were accurate; they just weren't connected to any consequence.

Retry on Violation

The fix was to treat anti-pattern detection as a real quality failure, not an advisory signal. When a caption contains a blocked phrase, the pipeline now regenerates rather than proceeding. The regeneration prompt is more explicit about what to avoid, and it draws from the full expanded list.

We also extracted a shared regeneration helper that handles both caption variants — previously, the A variant had retry logic and the B variant didn't, which meant variant B was systematically under-constrained. The duplication was subtle enough to survive several reviews without anyone noticing. Unifying them eliminated the asymmetry.

Why Indonesian Guardrails Are Different

Building this list is not a one-time exercise. Indonesian consumer culture, regulatory expectations around advertising claims, and the social norms of WhatsApp-first commerce are all evolving. What reads as acceptable today may read as manipulative next year as consumer literacy grows.

We treat the anti-pattern list as a living document, not a configuration value. It grows from critique cycles, from onboarded businesses flagging output they're uncomfortable with, and from monitoring what phrases correlate with lower engagement or negative responses.

The goal isn't to make captions bland — it's to make them trustworthy. In a market where distrust of online sellers is high and word-of-mouth reputation matters enormously, captions that avoid manipulation aren't just ethically better. They convert better too.

What This Taught Us About Guardrails in General

Any content guardrail that only warns is a guardrail in name only. The signal has to be connected to a consequence — regeneration, escalation, or human review — or it doesn't change the output. Logging violations you never act on is worse than not logging them, because it creates a false sense that the problem is being monitored.

For teams building AI content pipelines: audit your warn-only rules regularly. Ask which ones have been firing without triggering any response. Those are your hidden gaps.