HomePricingAbout
Case Study · Fastscraping

Real-Time Price Insights That Keep You Ahead

Playwright-based crawling, rotating proxies, and automated QA raised coverage 6.5× and cut latency to just 1 hour.

Refresh Latency24h → 1h
Coverage (stores)120 → 780
Schema Pass97.8%
ROI6.2×

Overview

Client & Context

Client
Anonymous CE Retailer
Type
B2C
Location
EU

Goal

Live price tracking across categories with hourly refresh.

Discovery

Problem & Constraints

Problems
  • Dynamic pricing with frequent changes
  • CAPTCHA walls on product pages
  • Geo-blocking & rate limiting
Constraints
  • Respect robots.txt & terms of service
  • Do not scrape PII
  • Maintain < 1 req/s per host

Design

Solution Architecture

Crawler (Playwright)Queue (Redis)Parser (Node/Python)QA (GE)Storage (Postgres/S3)Dashboard (Metabase)
PlaywrightRedisNode.jsPythonPostgreSQLS3Metabase

Highlights

  • Rate-aware crawling with backoff & retries
  • Rotating proxies + geo-routing
  • Schema validation & dedupe pipeline
  • DOM-change detection alerts

Execution

Process & Quality

  1. 1

    Discovery & PoC

    Week 1

    Scope sources, define schema, sample run.

  2. 2

    Hardening

    Weeks 2–3

    Retries, human-like delays, CAPTCHA plan, change-detection.

  3. 3

    Scale-up

    Ongoing

    Scheduling, monitoring/alerts, docs & handover.

Learnings

Challenges & How We Solved

Real-world blockers we hit and the concrete fixes we shipped.

  • 1Issue

    Strict rate limits from multiple hosts

    Fix

    Queue bucketing per-host + exponential backoff with jitter

    Impact · Error rate ↓ 78%, stable hourly refresh
    Technical details

    Requests are grouped by hostname; each bucket has its own token bucket limiter. Retries follow 2^n+random(ms).

    Resolved & monitored
  • 2Issue

    CAPTCHA walls on product pages

    Fix

    Token-based challenge solving (where permitted) + human-like navigation

    Impact · Block rate ↓ 63%, consistent crawl windows
    Technical details

    We prefetch challenge tokens via server-to-server and use deterministic waits, scrolling and input cadence to reduce flags.

    Resolved & monitored
  • 3Issue

    Frequent layout / DOM changes

    Fix

    Semantic locators + CSS/XPath fallback + DOM-hash change alerts

    Impact · Parser breaks ↓ 70%, faster hotfix turnaround
    Technical details

    Primary selectors use data-* and ARIA roles; fallback chain tries CSS→XPath. A nightly diff alerts us when DOM hash changes.

    Resolved & monitored