Does this scraper respect robots.txt?

Yes. The fetch layer is required to check robots.txt before any request and use a descriptive User-Agent. The whole prompt is framed around being a polite citizen of the target site, since that discipline is what separates a sustainable scraper from a banned IP.

How does it handle rate limits and server errors?

It rate-limits to your configured `[rps]` and uses tenacity for exponential backoff with jitter on 429 and 5xx responses. The advice is to get the backoff working before scaling concurrency, or you risk hammering the site into blocking you.

Can I stop and resume a long scrape?

The NDJSON output is required to be resumable so a re-run skips already-fetched URLs. This means a crash or manual stop does not force a full re-fetch, which matters for large jobs against a single `[target_site]`.

What happens to rows that fail to parse?

Malformed rows are skipped with a logged warning rather than crashing the run, and selectolax parses the rest into your `[schema]`. A final run summary reports how many were fetched, skipped, failed, and how long it took.

Claude/ChatGPT Prompt to Write a Polite Python Web Scraper with Retries | AI Prompt Library

What this prompt does

This prompt makes the model a senior Python engineer who builds data pipelines and asks for working code that runs as-is and behaves as a polite citizen of the target site. It specifies six deliverables: a fetch layer using httpx with a descriptive User-Agent and robots.txt checking before any request, rate limiting plus exponential backoff with jitter on 429 and 5xx via tenacity, parsing with selectolax that skips malformed rows with a logged warning, resumable NDJSON output that skips already-fetched URLs on re-run, structured logging with a final run summary, and a small argparse CLI. The structure works because it front-loads the boring discipline that separates a useful scraper from a banned IP.

Three variables drive it. [target_site] is the site to scrape (default a products page). [rps] caps requests per second, defaulting to 1, which the rate limiter enforces. [schema] lists the output fields — title, price, sku, stock — that selectolax parses into. The backoff in deliverable two is the load-bearing part: getting it working before scaling up concurrency is what keeps you from hammering the site into blocking you. Robots.txt checking and an honest User-Agent are not afterthoughts here; they are required before the first request goes out.

When to use it

You need a web scraper that respects robots.txt and won't get your IP banned.
Rate limiting and exponential backoff with jitter on 429 and 5xx are requirements.
You want resumable output so a re-run skips already-fetched URLs.
Malformed rows should be skipped with a logged warning rather than crashing the run.
You need a final run summary (fetched, skipped, failed, elapsed) for observability.
You want a single runnable module plus a requirements list and the exact run command.

Example output

Expect a single runnable module plus a requirements list and the exact run command. The fetch layer uses httpx with a descriptive User-Agent and checks robots.txt before requesting; rate limiting holds to [rps] with tenacity-driven exponential backoff and jitter on 429 and 5xx; selectolax parses into [schema], skipping malformed rows with logged warnings; results persist as resumable NDJSON keyed so re-runs skip already-fetched [target_site] URLs; structured logging ends with a fetched/skipped/failed/elapsed summary; and an argparse CLI sets site, rate, and output path. It runs as-is and is a solid starting point you adapt.

Pro tips

Set [rps] conservatively to start; you can raise it once backoff is proven, but a low cap protects you from bans.
Match [schema] exactly to the fields you actually need so selectolax selectors stay focused.
Point [target_site] at the real listing page so the parser targets the correct DOM structure.
Get the backoff working before scaling concurrency, or you'll just hammer the site into blocking you.
Keep robots.txt checking on; an honest User-Agent and respect for crawl rules are what keep a scraper sustainable.
Use the resumable NDJSON design for long runs so a crash or stop doesn't force a full re-fetch.

Details

Claude/ChatGPT Prompt to Write a Polite Python Web Scraper with Retries

Fill in the placeholders

What this prompt does

When to use it

Example output

Pro tips

Frequently Asked Questions

Engr Mejba Ahmed

More in Python & Automation Prompts

Claude/ChatGPT Prompt to Build a Type-Safe Python CLI with Click

Claude/ChatGPT Prompt to Automate a Daily Report Email with Pandas

Claude/ChatGPT Prompt to Refactor a Python Script into a Package

Ready to Transform

Your Ideas?

Engr Mejba Ahmed

Hey there!