Case Study · Web Scraping & Data Aggregation

Multi-Source Product Intelligence Platform for a US Trade Research Firm

A US-based trade research firm (confidential client) needed to scrape, normalize, and analyse product data from 50 different sources — manufacturer catalogues, distributor sites, marketplace listings, and regulatory databases. The goal was to support docket research and produce annual industry reports. SpiderHunts built an end-to-end product intelligence platform: scheduled multi-source scraping, semantic deduplication via vector embeddings, an analyst workbench dashboard, and automated report generation. The platform tracks 2.4 million products in near-real-time and lifted docket coverage from 30 per year to over 120. Same headcount, better data.

Book Free Discovery Call See Results

Trade Research & Legal IntelligenceIndustry

Multi-Source Web Scraping + Data AggregationProject Type

14 weeksDuration

£24,000 fixed-price + £1,200/mo operationsInvestment

🔒

Confidential Engagement

This engagement is covered by a non-disclosure agreement. The client name, project URLs, and any identifying details have been withheld. All metrics, architecture and outcomes shown are accurate and verifiable to qualified prospective clients on request.

Project Snapshot

A US trade research firm (client identity withheld under non-disclosure) supported clients on docket cases that required deep product-level evidence. That evidence included supplier lists, pricing histories, technical specifications, compliance markings, and country-of-origin data. All of this lived across roughly 50 different websites in inconsistent formats. Their analysts spent weeks per case manually compiling spreadsheets, and annual industry reports took months of cross-referencing.

Industry: Trade Research & Legal Intelligence
Project Type: Multi-Source Web Scraping + Data Aggregation
Duration: 14 weeks
Investment: £24,000 fixed-price + £1,200/mo operations
Sources Aggregated: 50 (manufacturers, distributors, marketplaces, regulators)
Products Tracked: 2.4 million

The Challenge

Before SpiderHunts

Analysts spent 3 weeks per docket on manual product data compilation
50+ source websites with inconsistent schemas, formats, and HTML structures
Same product described differently across sources — manual matching was a major bottleneck
Annual industry reports took 8–12 weeks of analyst time to produce
Limited capacity meant the firm could only support roughly 30 dockets per year
Pricing and availability data went stale fast — research was often weeks out of date by submission

After SpiderHunts

50 sources — Automated scraping pipelines
2.4 million — Products tracked in near-real-time
3 weeks → 2 days — Time to compile docket evidence
30/year → 120+/year — Dockets supported
8–12 weeks → 3 days — Annual report production time
<24 hours — Average data freshness lag

The Solution

SpiderHunts designed and built a unified Product Intelligence Platform that:

Handles all 50 sources on automated schedules
Normalizes data into a single product schema
Intelligently deduplicates and matches products across sites using semantic vector embeddings
Gives analysts a single workbench to query, compare, and export
Generates annual reports automatically

Multi-source scraping fleet

Fifty source-specific scrapers built with Scrapy, Playwright, or Beautiful Soup depending on site complexity. Some run hourly (price-sensitive), some daily, some weekly. Each scraper has its own monitoring, retry logic, and schema-drift alerting.

Anti-bot & compliance layer

Rotating residential proxies, randomized headers, realistic browsing patterns, robots.txt-aware crawling, and rate-limited polite scraping. Compliance review run on every source before ingestion.

Unified product schema

Every source-specific scraper outputs into a single canonical product schema — manufacturer, model, SKU, specs, pricing history, supplier, country of origin, compliance markings. PostgreSQL with JSONB columns for source-specific attributes that do not fit the canonical model.

Semantic product matching

Different sources describe the same product differently. OpenAI embeddings of product titles, descriptions and specs stored in pg_vector enable fuzzy matching across sources. Confidence-graded clustering surfaces duplicates for analyst confirmation when ambiguous.

Pricing & availability history

Every scrape captures the current state and appends to a time-series. Analysts can plot price evolution, supplier transitions, and stock fluctuations across any date range. This is essential for retrospective docket evidence.

Analyst workbench

React + TypeScript dashboard with faceted search across all 2.4M products. Saved searches, watchlists, alerting on changes, side-by-side product comparisons, and one-click export to CSV/Excel.

Annual report generator

Templated industry reports auto-populated with current data — charts, tables, supplier directories, year-on-year trend graphs. Generated as PDF via WeasyPrint with the firm's branding, ready for analyst review and client delivery.

Observability & data quality

Datadog dashboards on scrape success rates, freshness lag per source, deduplication confidence distributions, and analyst usage patterns. Schema-drift alerts fire to the engineering team within minutes of any source breaking.

Technology

Tech Stack

Production-grade components selected for reliability at scale, observability, and the long maintenance horizon of an ongoing data platform.

Python 3.12 Scrapy Playwright Beautiful Soup FastAPI PostgreSQL JSONB columns pg_vector Full-text search Redis Celery AWS ECS (us-east-1) S3 data lake React + TypeScript dashboard WeasyPrint (PDF reports) Datadog observability

Measurable Outcomes

Results

Measured at the 6-month mark following platform launch and full source ramp-up.

50 sources

Automated scraping pipelines

2.4 million

Products tracked in near-real-time

3 weeks → 2 days

Time to compile docket evidence

−90%

30/year → 120+/year

Dockets supported

+300%

8–12 weeks → 3 days

Annual report production time

−96%

<24 hours

Average data freshness lag

Analyst productivity

~5 months

Project payback period

“

We went from supporting 30 dockets a year to over 120 — same team, with better data. The platform pays for itself every quarter, and our annual reports now ship in days, not months.

— Research Director, US Trade Research Firm (confidential)

Ready for a Similar Result?

Need to aggregate data across dozens of sources, normalize it, and turn it into evidence or reports? SpiderHunts builds production scraping + data platforms with compliance and observability built in. Book a free discovery call.

Book a Free Discovery Call WhatsApp Us