JiwaAI
Blog
โ†All posts
cost-optimization
architecture
ai
pricing

What a Single Jiwa AI Post Actually Costs

Jiwa AI Teamยท

The Number First

Each post Jiwa AI generates โ€” including the brand analysis that powers it, the influencer matching, the caption, the image, and the WhatsApp delivery โ€” costs approximately $0.12.

Across a typical seven-post onboarding, the total AI spend is under ninety cents.

That number is not an approximation or a best-case scenario. It comes directly from the cost logging every onboarding writes to the database: service, model, operation, tokens consumed, pixels generated, duration. Here is how those numbers break down.

The Two Cost Centres

Every onboarding has two distinct spending categories: intelligence (text AI calls for brand analysis, captions, quality scoring) and image generation (the multi-step visual pipeline). They have very different cost profiles.

Intelligence is cheap. Image generation is where money actually goes.

Intelligence: ~$0.065 per Onboarding

The intelligence pipeline runs across Waves 2 through 8 and involves roughly twelve AI calls total.

Wave 2 is the only heavyweight call: a full brand analysis that reads the scraped website content, the Instagram feed if available, and produces the brand profile that every downstream step depends on. This is the one place where we use a premium-tier model, because the quality of this single call determines the quality of everything else. Typical cost: around $0.04.

Waves 3 through 6 run entirely on a faster, cheaper model. Theme extraction, product positioning analysis, influencer matching, mood board analysis, product visual analysis โ€” each is a focused, structured task that does not need the reasoning depth of the premium model. These calls run in parallel where possible. Total cost for all of them combined: around $0.015.

Wave 8 captions generate all seven post captions in a single batched call rather than seven separate requests. One call, one context payload, seven outputs. This is the most direct cost-saving architectural decision we made: batching eliminates repeated transmission of brand context and cuts latency by roughly 6x compared to sequential calls. Cost: around $0.003.

Vision quality checks โ€” assessing whether each generated image meets brand standards and whether the product is actually visible โ€” run in parallel with image generation and add roughly $0.001 per image.

Image Generation: ~$0.113 per Image

This is where the money is. Every new image goes through two inference steps.

Step 1 is a multi-reference generation pass. The pipeline assembles up to ten reference images in priority order: the influencer photograph first (if present), then product references, then mood board images sorted by engagement. A detailed prompt โ€” constructed from the brand DNA, the content calendar spec, and the visual format directive โ€” is sent alongside these references. The model synthesises them into a new scene.

At 1024ร—1024 pixels, the cost for this step is approximately $0.072. The pricing scales with megapixels: $0.07 for the first megapixel, $0.03 per additional megapixel. A square 1024-pixel image is just barely above one megapixel, so the base rate applies.

Step 2 is a naturalisation pass. The output of Step 1 is processed by a second model with a single instruction: enhance photorealism while changing nothing. Add skin pores, realistic hair strand detail, subtle lighting imperfections. Remove waxy or CGI-smooth surfaces. Do not touch faces, products, clothing, or background. This pass costs a flat $0.04 regardless of image size.

The two-step architecture exists because the best multi-reference generation models are trained for composition fidelity โ€” getting all the references correctly into the scene โ€” while the best naturalisation models are trained to make the result look like a real photograph. Asking one model to do both produces results that look good in reference adherence but fall short on photorealism. Using two models in sequence gets both.

Combined per-image cost: $0.072 + $0.040 + $0.001 = $0.113

For a seven-post calendar, that is $0.79 in image generation alone.

WhatsApp: $0.007

Seven messages at $0.001 per send. Not worth optimising.

Total: $0.86

Line item Cost
Premium brand analysis (ร—1) $0.040
Fast LLM calls โ€” enrichment, captions, quality (ร—11) $0.025
Image generation Step 1 (ร—7) $0.504
Image generation Step 2 naturalise (ร—7) $0.280
Vision quality checks (ร—7) $0.007
WhatsApp delivery (ร—7) $0.007
Total $0.863

Time: 3โ€“4 Minutes

Cost is one axis. Latency is the other.

The intelligence waves (1 through 7) take roughly 50โ€“90 seconds in total. Most of that is Wave 2 โ€” the Sonnet brand analysis โ€” with the parallel enrichment waves adding another 20โ€“30 seconds.

Image generation is the bottleneck. Each image takes 25โ€“40 seconds: Step 1 averages around 20 seconds, Step 2 around 15 seconds. We run three images concurrently (pLimit(3)) to avoid overwhelming the API. Seven images across three batches of parallel work takes roughly 2โ€“2.5 minutes.

Wave 9 โ€” saving posts to the database and sending WhatsApp previews โ€” adds another 10 seconds.

End-to-end: 3โ€“4 minutes from URL submission to WhatsApp delivery.

What Revisions Cost

When a user regenerates or revises a post, the pipeline routes differently. Revisions skip Step 1 entirely and run only the instruction-edit naturalisation model against the previous output. One call, one step.

Revision cost: $0.04 โ€” roughly 35% of the original generation cost.

This makes the economics of "generate, review, iterate" viable. A user who requests three revisions before approving a post pays $0.04 + $0.04 + $0.04 = $0.12 in revision costs on top of the original $0.113. Total for that post: $0.233. Still under twenty-five cents.

Why the Pipeline Is Two Steps Instead of One

The obvious question is why we pay for two API calls per image instead of finding one model that handles everything.

The answer is precision. Multi-reference generation models are optimised to synthesise content from multiple inputs โ€” they excel at composition and reference adherence. Naturalisation models are optimised for a single task: making an image look less like a render and more like a photograph. These are different capabilities that currently live in different architectures.

The cost of running both โ€” $0.072 + $0.040 โ€” is the cost of not having to choose between an image that correctly shows the product and influencer versus an image that looks like it was taken with a camera. Both criteria matter for content that gets posted publicly under a brand.

The Ceiling

At current pricing, the hard floor for a seven-post onboarding is around $0.75 โ€” even if intelligence became free, image generation accounts for 90% of the cost. The ceiling under our architecture is around $1.05 on a bad day with retries.

That band โ€” seventy-five cents to a dollar โ€” is the economic reality of producing seven pieces of original, brand-consistent, photorealistic influencer content. Compared to the cost of a single human-shot lifestyle photo, it is not a close comparison.