What this prompt does
This prompt makes the model a senior Python engineer who builds data pipelines and asks for working code that runs as-is and behaves as a polite citizen of the target site. It specifies six deliverables: a fetch layer using httpx with a descriptive User-Agent and robots.txt checking before any request, rate limiting plus exponential backoff with jitter on 429 and 5xx via tenacity, parsing with selectolax that skips malformed rows with a logged warning, resumable NDJSON output that skips already-fetched URLs on re-run, structured logging with a final run summary, and a small argparse CLI. The structure works because it front-loads the boring discipline that separates a useful scraper from a banned IP.
Three variables drive it. [target_site] is the site to scrape (default a products page). [rps] caps requests per second, defaulting to 1, which the rate limiter enforces. [schema] lists the output fields — title, price, sku, stock — that selectolax parses into. The backoff in deliverable two is the load-bearing part: getting it working before scaling up concurrency is what keeps you from hammering the site into blocking you. Robots.txt checking and an honest User-Agent are not afterthoughts here; they are required before the first request goes out.
When to use it
- You need a web scraper that respects robots.txt and won't get your IP banned.
- Rate limiting and exponential backoff with jitter on 429 and 5xx are requirements.
- You want resumable output so a re-run skips already-fetched URLs.
- Malformed rows should be skipped with a logged warning rather than crashing the run.
- You need a final run summary (fetched, skipped, failed, elapsed) for observability.
- You want a single runnable module plus a requirements list and the exact run command.
Example output
Expect a single runnable module plus a requirements list and the exact run command. The fetch layer uses httpx with a descriptive User-Agent and checks robots.txt before requesting; rate limiting holds to [rps] with tenacity-driven exponential backoff and jitter on 429 and 5xx; selectolax parses into [schema], skipping malformed rows with logged warnings; results persist as resumable NDJSON keyed so re-runs skip already-fetched [target_site] URLs; structured logging ends with a fetched/skipped/failed/elapsed summary; and an argparse CLI sets site, rate, and output path. It runs as-is and is a solid starting point you adapt.
Pro tips
- Set
[rps]conservatively to start; you can raise it once backoff is proven, but a low cap protects you from bans. - Match
[schema]exactly to the fields you actually need so selectolax selectors stay focused. - Point
[target_site]at the real listing page so the parser targets the correct DOM structure. - Get the backoff working before scaling concurrency, or you'll just hammer the site into blocking you.
- Keep robots.txt checking on; an honest User-Agent and respect for crawl rules are what keep a scraper sustainable.
- Use the resumable NDJSON design for long runs so a crash or stop doesn't force a full re-fetch.