Guide
Getting data from Cloudflare-protected public sites (the compliant way)
You open a public catalog or listings page in your browser and the data is right there. Your script fetches the same URL and gets back an empty page. The usual conclusion — "it is blocked by a CAPTCHA" — is often wrong. More often it is a technical fingerprint check that happens before the page renders. Here is what is really going on, and how to read those public pages the compliant way.
"Empty page" is usually not a CAPTCHA
When a Cloudflare-protected site returns an empty or minimal page to a script but a full page to your browser, the most common reason is a TLS fingerprint check. Before any content is rendered, the server inspects the technical signature of the connecting client — the shape of the TLS handshake, the order of headers, the negotiated parameters. A real browser presents one signature; a default scripting library presents a noticeably different one. If the signature does not look like an ordinary browser, the server short-circuits and serves nothing meaningful.
The important part: this happens at the connection layer, before any human-facing challenge. There is no puzzle to solve and no "are you a robot" box. The page simply never rendered for that client. That is why people see "empty pages" and wrongly assume a CAPTCHA — the actual mechanism is the handshake not matching what a browser would send.
How a properly fingerprinted session reads the public page
The compliant approach is straightforward: connect with a real, correctly fingerprinted browser session so the public page loads the same way it does for any visitor — and do it at a respectful rate.
- A genuine browser session. Use a real browser engine whose TLS handshake and headers match what the site already expects from everyday visitors, so the public page renders as intended.
- One visitor's pace. Read pages at a human, unhurried rate — spacing requests, honoring crawl-delay — rather than flooding the source. A respectful footprint keeps access stable and is simply the right thing to do.
- Public pages only. Read what any visitor can see without logging in. Nothing gated, nothing personal — just the public, factual fields you came for.
- Hidden API where one exists. Once the session renders correctly, the page may expose the same internal JSON endpoint we describe in what is a hidden API — usually the cleanest, lightest way to read the data.
We keep a demonstration of the fingerprinting concept in the open: see the tls-fingerprint-scraper-demo repository, which shows why a default client returns an empty page where a correctly fingerprinted session reads the public content normally.
Staying on the right side of the line
Reading public data reliably and doing it responsibly are the same job. Our rules do not bend:
- Read and respect the site's Terms of Service. If the terms forbid automated access, we do not proceed.
- Honor
robots.txtand rate limits — always, by design. - Access only public, factual, non-PII data. No login-gated content, no personal information.
- You operate and own the resulting feed; we build it to behave like a courteous visitor of the source.
The short version
Empty pages from a Cloudflare-protected public site are usually a TLS fingerprint check, not a CAPTCHA. A correctly fingerprinted browser session reads the public page like any normal visitor, at a respectful rate, within the site's terms. That is how you turn "it works in my browser but not in my script" into a feed you can rely on.