JiwaAI
Blog
โ†All posts
engineering
image-generation
content-generation
ai
computer-vision

From Static Scenes to AI-Directed Content

Jiwa AI Teamยท

The Same Pose in Every Photo

Open any small business Instagram account that uses AI-generated content and you will spot the pattern within three posts. The influencer is holding the product. The background is a generic cafe. The camera angle is straight-on. The text overlay cuts off mid-sentence. Swipe to the next post and it is the same scene with a different caption fragment pasted on top.

This was our problem too. We had two or three hardcoded scene templates per industry โ€” a coffee shop background for F&B, a studio backdrop for beauty, a street market for fashion. Every post for a given business drew from the same shallow pool. The content was technically correct but visually monotonous. Nobody engages with a feed that looks like the same photo copied six times.

Three Problems, One Root Cause

The repetition came from the system treating every product as a single monolithic object. A skincare set with a cleanser, toner, and moisturizer was just "skincare product." A nasi goreng plate with rice, fried egg, sambal, and kerupuk was just "food item." When the AI only sees one thing, it can only generate one kind of interaction โ€” someone holding that one thing.

The text overlay had its own version of this problem. We were truncating overlay text to six or eight words through a sanitization function, which meant carefully written phrases got cut mid-thought. "Transform your morning routine with" is not a sentence. Combined with the repetitive scenes, the result was content that felt both generic and broken.

And underneath the images themselves, AI text generation was leaking through. Despite prompt instructions, image models would occasionally render words directly into the scene โ€” blurry, misspelled text baked into the pixels where our clean Sharp-based overlay should go.

Decomposing Products Into Their Parts

The first change was teaching the system to see products the way a photographer would. Before generating any content, we now send product images through Claude Haiku Vision with a specific task: identify the individual components. A bakery box becomes croissants, macarons, and a branded ribbon. A phone accessory kit becomes the case, the screen protector, and the cleaning cloth.

This decomposition step costs fractions of a cent per product โ€” Haiku Vision is the lightweight model we use for structured analysis tasks. But it unlocks a fundamental shift in how scenes are planned. Instead of six posts all showing "someone with a bakery box," the system can now generate one post featuring a close-up of a croissant being pulled apart, another with macarons arranged on a marble surface, another with the branded ribbon being untied. Each post highlights a different component, and the influencer's interaction with it changes accordingly.

The rotation is deliberate. The system tracks which components have been featured and cycles through them, ensuring variety across the content calendar without repeating the same focal element in consecutive posts.

AI Scene Planning Replaces Static Templates

With product components identified, we needed scenes that matched them. The old approach โ€” picking from two or three hardcoded templates โ€” could not keep up with the variety the decomposition step now demanded. A croissant close-up needs different lighting than a ribbon-untying moment. A toner application needs a different camera angle than a moisturizer flat-lay.

We replaced the static templates with Claude-generated scene plans. For each post, the AI receives the product component, the brand's visual identity, and the content calendar context, then outputs a complete scene specification: background setting, influencer pose, camera angle, and lighting direction. The planning model has enough context to avoid repeating combinations across posts and to match the scene mood to the caption intent.

This is not the image generation step โ€” it is a lightweight text call that produces structured scene metadata. The actual image generation happens downstream, using these specifications as detailed prompt parameters. The cost of planning is negligible compared to the cost of generating the image itself, but the impact on visual variety is substantial.

A beauty brand that previously got six studio-lit product-holding shots now gets a morning bathroom scene with soft window light, an outdoor golden-hour application moment, a close-up vanity mirror reflection, and three other distinct setups. Same brand, same product line, six different visual stories.

Letting Text Flow

The overlay text fix was the simplest change and arguably the most overdue. The sanitizeOverlayText function had a word-count ceiling that truncated any text beyond six to eight words. The original intent was to prevent text from overflowing the image, but we already had a wrapText function that handles line breaking and text fitting. The truncation was redundant protection that created a worse problem than it solved.

We removed the word-count limit entirely. Text now flows to its natural length, with the existing wrap function handling line breaks across the available space. On the prompt side, we updated Claude's instructions to generate complete phrases rather than fragments โ€” no more sentences that trail off with an implied ellipsis.

The difference is immediately visible. "Transform your morning routine with our 3-step system" reads as a complete thought. The overlay feels intentional rather than accidentally cropped.

Guaranteeing Text-Free Base Images

The last piece was eliminating AI-generated text from base images entirely. Image generation models have a persistent tendency to render text into scenes โ€” storefront signs, product labels, floating captions โ€” even when prompted not to. The text is almost always illegible or misspelled, and it conflicts with our Sharp-based overlay system that adds clean, properly rendered typography afterward.

We strengthened the anti-text enforcement across every image generation function in the pipeline. Every prompt that touches Flux or any other image model now carries explicit negative instructions against text rendering, reinforced at multiple points in the prompt structure. The base image comes out clean, and all text arrives through the overlay step where we control font, size, placement, and readability.

This sounds like a minor prompt engineering tweak, but it eliminated a class of quality failures that previously required manual review to catch. A misspelled word baked into an image cannot be fixed without regenerating the entire image. A clean base image with a separate text overlay can always be corrected by re-rendering just the overlay.

The Compound Effect of Variety

Each of these changes is individually small. Product decomposition is one additional Haiku Vision call. Scene planning is one additional text generation call. The text overlay fix is removing code. The anti-text enforcement is prompt adjustments.

But together they transform the output. A content calendar that previously showed the same scene six times with truncated text fragments now shows six distinct visual stories, each featuring a different product component, in a unique setting, with complete overlay text rendered cleanly on a text-free base image.

The cost increase per onboarding is under five cents โ€” one Haiku Vision call for decomposition plus a lightweight planning call per post. The quality increase is the kind that changes whether a business owner approves their content calendar on first review or asks for revisions. And every avoided revision is time saved for them and compute saved for us.

Variety is not a luxury feature. For small businesses posting daily on Instagram, a repetitive feed signals low effort to their audience. AI-generated content needs to clear a higher bar than "technically correct" โ€” it needs to look like someone thought about each post individually. Now the system actually does.