Case Study · Web Scraping & Data Aggregation

Multi-Source Product Intelligence Platform for a US Trade Research Firm

A US-based trade research firm (confidential client) needed to scrape, normalize, and analyse product data from 50 different sources — manufacturer catalogues, distributor sites, marketplace listings, and regulatory databases — to support docket research and produce annual industry reports. SpiderHunts built an end-to-end product intelligence platform: scheduled multi-source scraping, semantic deduplication via vector embeddings, an analyst workbench dashboard, and automated report generation. The platform tracks 2.4 million products in near-real-time and lifted docket coverage from 30 per year to over 120 — same headcount, better data.

Trade Research & Legal IntelligenceIndustry
Multi-Source Web Scraping + Data AggregationProject Type
14 weeksDuration
£24,000 fixed-price + £1,200/mo operationsInvestment
🔒
Confidential Engagement
This engagement is covered by a non-disclosure agreement. The client name, project URLs, and any identifying details have been withheld. All metrics, architecture and outcomes shown are accurate and verifiable to qualified prospective clients on request.

Project Snapshot

A US trade research firm (client identity withheld under non-disclosure) supported clients on docket cases that required deep product-level evidence — supplier lists, pricing histories, technical specifications, compliance markings, and country-of-origin data. All of this lived across roughly 50 different websites in inconsistent formats. Their analysts spent weeks per case manually compiling spreadsheets, and annual industry reports took months of cross-referencing.

Industry
Trade Research & Legal Intelligence
Project Type
Multi-Source Web Scraping + Data Aggregation
Duration
14 weeks
Investment
£24,000 fixed-price + £1,200/mo operations
Sources Aggregated
50 (manufacturers, distributors, marketplaces, regulators)
Products Tracked
2.4 million
The Challenge

The Challenge

A US trade research firm (client identity withheld under non-disclosure) supported clients on docket cases that required deep product-level evidence — supplier lists, pricing histories, technical specifications, compliance markings, and country-of-origin data. All of this lived across roughly 50 different websites in inconsistent formats. Their analysts spent weeks per case manually compiling spreadsheets, and annual industry reports took months of cross-referencing.

Before SpiderHunts

  • Analysts spent 3 weeks per docket on manual product data compilation
  • 50+ source websites with inconsistent schemas, formats, and HTML structures
  • Same product described differently across sources — manual matching was a major bottleneck
  • Annual industry reports took 8–12 weeks of analyst time to produce
  • Limited capacity meant the firm could only support roughly 30 dockets per year
  • Pricing and availability data went stale fast — research was often weeks out of date by submission

After SpiderHunts

  • 50 sources — Automated scraping pipelines
  • 2.4 million — Products tracked in near-real-time
  • 3 weeks → 2 days — Time to compile docket evidence
  • 30/year → 120+/year — Dockets supported
  • 8–12 weeks → 3 days — Annual report production time
  • <24 hours — Average data freshness lag
The Solution

The Solution

SpiderHunts designed and built a unified Product Intelligence Platform that handles all 50 sources on automated schedules, normalizes data into a single product schema, intelligently deduplicates and matches products across sites using semantic vector embeddings, and gives analysts a single workbench to query, compare, and export — alongside automated annual report generation.

01

Multi-source scraping fleet

Fifty source-specific scrapers built with Scrapy, Playwright, or Beautiful Soup depending on site complexity. Some run hourly (price-sensitive), some daily, some weekly. Each scraper has its own monitoring, retry logic, and schema-drift alerting.

02

Anti-bot & compliance layer

Rotating residential proxies, randomized headers, realistic browsing patterns, robots.txt-aware crawling, and rate-limited polite scraping. Compliance review run on every source before ingestion.

03

Unified product schema

Every source-specific scraper outputs into a single canonical product schema — manufacturer, model, SKU, specs, pricing history, supplier, country of origin, compliance markings. PostgreSQL with JSONB columns for source-specific attributes that do not fit the canonical model.

04

Semantic product matching

Different sources describe the same product differently. OpenAI embeddings of product titles, descriptions and specs stored in pg_vector enable fuzzy matching across sources. Confidence-graded clustering surfaces duplicates for analyst confirmation when ambiguous.

05

Pricing & availability history

Every scrape captures the current state and appends to a time-series. Analysts can plot price evolution, supplier transitions, and stock fluctuations across any date range — essential for retrospective docket evidence.

06

Analyst workbench

React + TypeScript dashboard with faceted search across all 2.4M products. Saved searches, watchlists, alerting on changes, side-by-side product comparisons, and one-click export to CSV/Excel.

07

Annual report generator

Templated industry reports auto-populated with current data — charts, tables, supplier directories, year-on-year trend graphs. Generated as PDF via WeasyPrint with the firm's branding, ready for analyst review and client delivery.

08

Observability & data quality

Datadog dashboards on scrape success rates, freshness lag per source, deduplication confidence distributions, and analyst usage patterns. Schema-drift alerts fire to the engineering team within minutes of any source breaking.

Technology

Tech Stack

Production-grade components selected for reliability at scale, observability, and the long maintenance horizon of an ongoing data platform.

Python 3.12 Scrapy Playwright Beautiful Soup FastAPI PostgreSQL JSONB columns pg_vector Full-text search Redis Celery AWS ECS (us-east-1) S3 data lake React + TypeScript dashboard WeasyPrint (PDF reports) Datadog observability
Measurable Outcomes

Results

Measured at the 6-month mark following platform launch and full source ramp-up.

50 sources
Automated scraping pipelines
2.4 million
Products tracked in near-real-time
3 weeks → 2 days
Time to compile docket evidence
−90%
30/year → 120+/year
Dockets supported
+300%
8–12 weeks → 3 days
Annual report production time
−96%
<24 hours
Average data freshness lag
5x
Analyst productivity
~5 months
Project payback period
We went from supporting 30 dockets a year to over 120 — same team, with better data. The platform pays for itself every quarter, and our annual reports now ship in days, not months.

— Research Director, US Trade Research Firm (confidential)

Ready for a Similar Result?

Need to aggregate data across dozens of sources, normalize it, and turn it into evidence or reports? SpiderHunts builds production scraping + data platforms with compliance and observability built in. Book a free discovery call.

Book a Free Discovery Call WhatsApp Us