Scraping Without Breaking: SSRF Protection and the SPA Fallback

The Most Dangerous Input Field in Onboarding

When a new business signs up for Jiwa AI, the first thing we ask for is a website URL. It sounds harmless — we use it to understand the brand's visual identity, tone, and product catalog. But from a security perspective, that URL field is one of the most sensitive inputs in the entire product.

Here's the problem: when our server fetches the page at that URL, it becomes a proxy. And a proxy that will fetch anything a user points it at is a textbook Server-Side Request Forgery (SSRF) vulnerability. Without the right defenses, a bad actor could submit a URL that resolves to an internal service — a database admin panel, a cloud metadata endpoint, a private API — and our scraper would happily retrieve it for them.

What SSRF Actually Looks Like in Practice

SSRF attacks are not theoretical edge cases. They are a standard category of vulnerability in any system that makes server-side HTTP requests based on user input. Cloud environments are especially exposed: the metadata services that AWS, Google Cloud, and Azure expose to their instances often sit at well-known IP addresses that are reachable from inside the network but not from the public internet.

If our scraper didn't validate submitted URLs, a user could submit something that looks like a website address but resolves to one of those internal endpoints. The server would fetch it, extract the content, and potentially surface credentials, tokens, or internal configuration that was never meant to leave the network.

The fix is URL validation — strict, applied before any outbound request is made.

The Blocklist: What We Refuse to Fetch

Before our scraper makes a single network call, every submitted URL passes through an allowlist check. We reject anything that doesn't use HTTP or HTTPS — no file references, no custom protocol schemes, nothing that shouldn't appear in a public business website.

We then check the hostname against a set of blocked categories. Explicit local names like localhost and the IPv4 unspecified address are blocked outright. So are IPv6 loopback addresses. Internal domain suffixes — the kind used by corporate networks and service meshes — are also rejected before DNS resolution happens.

For IP addresses, we check against the full set of private and reserved ranges: the loopback range, all three RFC 1918 private blocks, and the link-local address range that cloud metadata services frequently use. If the submitted hostname resolves to any of these, the request is rejected before it leaves our server.

Why We Check the URL, Not Just the Response

A subtler point: we validate the URL itself, not just the final destination after any redirects. A crafty SSRF attempt might submit a seemingly legitimate public URL that immediately redirects to an internal address. Our validation runs on the submitted hostname at the point of input, and we do not follow redirect chains into private space.

This means we sometimes reject valid-looking URLs that happen to share naming patterns with internal resources. That's an acceptable trade-off. A false positive on onboarding is a minor inconvenience. A false negative on SSRF is a potential security incident.

The SPA Problem

With the security layer in place, we run the scraper against the validated URL. For most business websites — especially Indonesian SMEs running standard e-commerce storefronts or WordPress-style sites — this works well. We fetch the HTML, parse the content, extract color hints, collect product images, and build a brand profile.

But a meaningful segment of modern websites don't deliver their content in the initial HTML response at all. Single-page applications built with frameworks like React, Vue, or Next.js in client-rendered mode send back a nearly empty shell on the first request. All the actual content is loaded by JavaScript running in the browser — which our scraper, fetching raw HTTP, never executes.

When that happens, we end up with a sparse content result: a page title, maybe a meta description, and very little else. That's not enough to build a useful brand profile.

Enter the Jina Reader Fallback

For cases where the direct fetch returns sparse content, we route the request through Jina Reader — a third-party service that renders JavaScript-heavy pages and returns their content as clean, structured text. It handles the browser execution step that our primary scraper skips, effectively solving the SPA problem without us having to run a headless browser ourselves.

The fallback is triggered automatically when the primary scrape returns below a content threshold. From the user's perspective, onboarding still completes — they don't see a retry or an error. The system detects the sparse result, routes silently to Jina Reader, and continues building the brand profile with the richer output.

This design reflects a principle we try to apply broadly: fail gracefully, not loudly. The right response to a limitation in one path is a well-designed fallback, not an error screen.

Reliability and Security as a Pair

It would be easy to write a scraper that's extremely secure — one that refuses most URLs and returns minimal data. It would also be useless. It would be equally easy to write a scraper that fetches anything and returns rich content — one that exposes the server to trivial SSRF attacks.

The goal is to hold both requirements at once. The URL validation layer makes sure we only reach the public internet. The Jina Reader fallback makes sure that when the public internet talks back in a format we weren't expecting, we still get something useful.

The scraper is the first real interaction between Jiwa AI and a new business. How well it handles that moment — how reliably it extracts brand signal, and how safely it does so — shapes everything that follows. We intend to keep raising the bar on both.