Multi-Source Product Intelligence Platform for a US Trade Research Firm
A US-based trade research firm (confidential client) needed to scrape, normalize, and analyse product data from 50 different sources — manufacturer catalogues, distributor sites, marketplace listings, and regulatory databases — to support docket research and produce annual industry reports. SpiderHunts built an end-to-end product intelligence platform: scheduled multi-source scraping, semantic deduplication via vector embeddings, an analyst workbench dashboard, and automated report generation. The platform tracks 2.4 million products in near-real-time and lifted docket coverage from 30 per year to over 120 — same headcount, better data.
Project Snapshot
A US trade research firm (client identity withheld under non-disclosure) supported clients on docket cases that required deep product-level evidence — supplier lists, pricing histories, technical specifications, compliance markings, and country-of-origin data. All of this lived across roughly 50 different websites in inconsistent formats. Their analysts spent weeks per case manually compiling spreadsheets, and annual industry reports took months of cross-referencing.
- Industry
- Trade Research & Legal Intelligence
- Project Type
- Multi-Source Web Scraping + Data Aggregation
- Duration
- 14 weeks
- Investment
- £24,000 fixed-price + £1,200/mo operations
- Sources Aggregated
- 50 (manufacturers, distributors, marketplaces, regulators)
- Products Tracked
- 2.4 million
The Challenge
A US trade research firm (client identity withheld under non-disclosure) supported clients on docket cases that required deep product-level evidence — supplier lists, pricing histories, technical specifications, compliance markings, and country-of-origin data. All of this lived across roughly 50 different websites in inconsistent formats. Their analysts spent weeks per case manually compiling spreadsheets, and annual industry reports took months of cross-referencing.
Before SpiderHunts
- Analysts spent 3 weeks per docket on manual product data compilation
- 50+ source websites with inconsistent schemas, formats, and HTML structures
- Same product described differently across sources — manual matching was a major bottleneck
- Annual industry reports took 8–12 weeks of analyst time to produce
- Limited capacity meant the firm could only support roughly 30 dockets per year
- Pricing and availability data went stale fast — research was often weeks out of date by submission
After SpiderHunts
- 50 sources — Automated scraping pipelines
- 2.4 million — Products tracked in near-real-time
- 3 weeks → 2 days — Time to compile docket evidence
- 30/year → 120+/year — Dockets supported
- 8–12 weeks → 3 days — Annual report production time
- <24 hours — Average data freshness lag
The Solution
SpiderHunts designed and built a unified Product Intelligence Platform that handles all 50 sources on automated schedules, normalizes data into a single product schema, intelligently deduplicates and matches products across sites using semantic vector embeddings, and gives analysts a single workbench to query, compare, and export — alongside automated annual report generation.
01
Multi-source scraping fleet
Fifty source-specific scrapers built with Scrapy, Playwright, or Beautiful Soup depending on site complexity. Some run hourly (price-sensitive), some daily, some weekly. Each scraper has its own monitoring, retry logic, and schema-drift alerting.
02
Anti-bot & compliance layer
Rotating residential proxies, randomized headers, realistic browsing patterns, robots.txt-aware crawling, and rate-limited polite scraping. Compliance review run on every source before ingestion.
03
Unified product schema
Every source-specific scraper outputs into a single canonical product schema — manufacturer, model, SKU, specs, pricing history, supplier, country of origin, compliance markings. PostgreSQL with JSONB columns for source-specific attributes that do not fit the canonical model.
04
Semantic product matching
Different sources describe the same product differently. OpenAI embeddings of product titles, descriptions and specs stored in pg_vector enable fuzzy matching across sources. Confidence-graded clustering surfaces duplicates for analyst confirmation when ambiguous.
05
Pricing & availability history
Every scrape captures the current state and appends to a time-series. Analysts can plot price evolution, supplier transitions, and stock fluctuations across any date range — essential for retrospective docket evidence.
06
Analyst workbench
React + TypeScript dashboard with faceted search across all 2.4M products. Saved searches, watchlists, alerting on changes, side-by-side product comparisons, and one-click export to CSV/Excel.
07
Annual report generator
Templated industry reports auto-populated with current data — charts, tables, supplier directories, year-on-year trend graphs. Generated as PDF via WeasyPrint with the firm's branding, ready for analyst review and client delivery.
08
Observability & data quality
Datadog dashboards on scrape success rates, freshness lag per source, deduplication confidence distributions, and analyst usage patterns. Schema-drift alerts fire to the engineering team within minutes of any source breaking.
Tech Stack
Production-grade components selected for reliability at scale, observability, and the long maintenance horizon of an ongoing data platform.
Results
Measured at the 6-month mark following platform launch and full source ramp-up.
We went from supporting 30 dockets a year to over 120 — same team, with better data. The platform pays for itself every quarter, and our annual reports now ship in days, not months.
— Research Director, US Trade Research Firm (confidential)
Ready for a Similar Result?
Need to aggregate data across dozens of sources, normalize it, and turn it into evidence or reports? SpiderHunts builds production scraping + data platforms with compliance and observability built in. Book a free discovery call.