Skip to main content

Claude/ChatGPT Prompt to Write a Polite Python Web Scraper with Retries

Build a polite Python web scraper that respects robots.txt, rate-limits requests, retries with exponential backoff, and writes structured NDJSON output.

Fill in the placeholders

Edit the values, then copy your finished prompt.

Your Prompt
prompt.txt

                                

What this prompt does

This prompt makes the model a senior Python engineer who builds data pipelines and asks for working code that runs as-is and behaves as a polite citizen of the target site. It specifies six deliverables: a fetch layer using httpx with a descriptive User-Agent and robots.txt checking before any request, rate limiting plus exponential backoff with jitter on 429 and 5xx via tenacity, parsing with selectolax that skips malformed rows with a logged warning, resumable NDJSON output that skips already-fetched URLs on re-run, structured logging with a final run summary, and a small argparse CLI. The structure works because it front-loads the boring discipline that separates a useful scraper from a banned IP.

Three variables drive it. [target_site] is the site to scrape (default a products page). [rps] caps requests per second, defaulting to 1, which the rate limiter enforces. [schema] lists the output fields — title, price, sku, stock — that selectolax parses into. The backoff in deliverable two is the load-bearing part: getting it working before scaling up concurrency is what keeps you from hammering the site into blocking you. Robots.txt checking and an honest User-Agent are not afterthoughts here; they are required before the first request goes out.

When to use it

  • You need a web scraper that respects robots.txt and won't get your IP banned.
  • Rate limiting and exponential backoff with jitter on 429 and 5xx are requirements.
  • You want resumable output so a re-run skips already-fetched URLs.
  • Malformed rows should be skipped with a logged warning rather than crashing the run.
  • You need a final run summary (fetched, skipped, failed, elapsed) for observability.
  • You want a single runnable module plus a requirements list and the exact run command.

Example output

Expect a single runnable module plus a requirements list and the exact run command. The fetch layer uses httpx with a descriptive User-Agent and checks robots.txt before requesting; rate limiting holds to [rps] with tenacity-driven exponential backoff and jitter on 429 and 5xx; selectolax parses into [schema], skipping malformed rows with logged warnings; results persist as resumable NDJSON keyed so re-runs skip already-fetched [target_site] URLs; structured logging ends with a fetched/skipped/failed/elapsed summary; and an argparse CLI sets site, rate, and output path. It runs as-is and is a solid starting point you adapt.

Pro tips

  • Set [rps] conservatively to start; you can raise it once backoff is proven, but a low cap protects you from bans.
  • Match [schema] exactly to the fields you actually need so selectolax selectors stay focused.
  • Point [target_site] at the real listing page so the parser targets the correct DOM structure.
  • Get the backoff working before scaling concurrency, or you'll just hammer the site into blocking you.
  • Keep robots.txt checking on; an honest User-Agent and respect for crawl rules are what keep a scraper sustainable.
  • Use the resumable NDJSON design for long runs so a crash or stop doesn't force a full re-fetch.

Frequently Asked Questions

Does this scraper respect robots.txt?
Yes. The fetch layer is required to check robots.txt before any request and use a descriptive User-Agent. The whole prompt is framed around being a polite citizen of the target site, since that discipline is what separates a sustainable scraper from a banned IP.
How does it handle rate limits and server errors?
It rate-limits to your configured `[rps]` and uses tenacity for exponential backoff with jitter on 429 and 5xx responses. The advice is to get the backoff working before scaling concurrency, or you risk hammering the site into blocking you.
Can I stop and resume a long scrape?
The NDJSON output is required to be resumable so a re-run skips already-fetched URLs. This means a crash or manual stop does not force a full re-fetch, which matters for large jobs against a single `[target_site]`.
What happens to rows that fail to parse?
Malformed rows are skipped with a logged warning rather than crashing the run, and selectolax parses the rest into your `[schema]`. A final run summary reports how many were fetched, skipped, failed, and how long it took.
Engr Mejba Ahmed

Need this built for real?

Engr Mejba Ahmed

AI Developer · Software Engineer

I'm Mejba — I design and ship production AI systems, automations, and full-stack apps. If you want this turned into a working solution for your team, let's talk.

More in Python & Automation Prompts

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support