Playwright-based crawling, rotating proxies, and automated QA raised coverage 6.5× and cut latency to just 1 hour.
Overview
Goal
Live price tracking across categories with hourly refresh.
Discovery
Design
Execution
Scope sources, define schema, sample run.
Retries, human-like delays, CAPTCHA plan, change-detection.
Scheduling, monitoring/alerts, docs & handover.
Learnings
Real-world blockers we hit and the concrete fixes we shipped.
Strict rate limits from multiple hosts
Queue bucketing per-host + exponential backoff with jitter
Requests are grouped by hostname; each bucket has its own token bucket limiter. Retries follow 2^n+random(ms).
CAPTCHA walls on product pages
Token-based challenge solving (where permitted) + human-like navigation
We prefetch challenge tokens via server-to-server and use deterministic waits, scrolling and input cadence to reduce flags.
Frequent layout / DOM changes
Semantic locators + CSS/XPath fallback + DOM-hash change alerts
Primary selectors use data-* and ARIA roles; fallback chain tries CSS→XPath. A nightly diff alerts us when DOM hash changes.