How to Build an AI Agent That Browses the Web and Extracts Data

A practical guide to building a web research agent: the architecture, tools, code patterns, and the pitfalls that trip up most teams on their first build.

By SpiderHunts Technologies  ยท  23 May 2026  ยท  11 min read

TL;DR

  • A web-browsing agent needs 4 tools: search, browse, extract, save
  • Use Playwright (not Requests) for JavaScript-rendered pages
  • GPT-4o with structured output is the best extraction model
  • Rate limiting and retry logic are non-negotiable for production
  • Always store raw HTML alongside extracted data โ€” you'll need it for debugging
  • The hardest part is not the code โ€” it's defining exactly what "good data" looks like

What a Web Research Agent Actually Does

A web research agent is an AI system that can receive a research goal, autonomously search the web, browse relevant pages, extract specific structured data, and save it somewhere useful โ€” all without a human clicking anything.

Examples of what businesses use these for:

  • Enriching a CRM with competitor pricing, company size, and decision-maker info
  • Monitoring price changes across supplier or competitor websites
  • Building lead lists from directories, LinkedIn, and industry sites
  • Tracking job postings to understand competitor hiring signals
  • Extracting product data from e-commerce sites for price comparison
  • Collecting regulatory or compliance updates from government websites

The Architecture: Four Tools You Need

Tool 1: web_search

Returns a list of URLs and snippets for a query. The agent uses this to find the right pages to browse, rather than guessing URLs.

Implementation: Serper API (ยฃ0.001/query) or Bing Search API. Returns top 10 results with title, URL, and snippet.

Tool 2: browse_url

Fetches a URL and returns cleaned, readable text. This is the core browsing capability โ€” it's what lets the agent "read" a web page.

Implementation: Playwright for JS-rendered pages; requests + BeautifulSoup for static HTML. Strip nav, footers, ads before returning.

Tool 3: extract_structured_data

Takes raw page text and extracts specific fields as structured JSON. This is where GPT-4o does its work โ€” reading unstructured content and pulling out exactly what you defined.

Implementation: GPT-4o with structured output (JSON mode) and a schema defining the fields to extract. Pydantic models work well here.

Tool 4: save_data

Persists the extracted data to the destination: CRM, database, spreadsheet, or webhook. Includes validation against your schema before writing.

Implementation: Depends on destination โ€” HubSpot API, PostgreSQL via SQLAlchemy, Google Sheets API, or a webhook to n8n/Zapier.

Step-by-Step Build Guide

Step 1 โ€” Define the Data Schema First

Before writing any code, define exactly what data you want. Vague goals produce vague results.

from pydantic import BaseModel
from typing import Optional

class CompanyData(BaseModel):
 company_name: str
 website: str
 pricing_tier_starter: Optional[str]
 pricing_tier_pro: Optional[str]
 pricing_tier_enterprise: Optional[str]
 free_trial_available: bool
 last_updated: str # ISO date

Step 2 โ€” Build the browse_url Tool

Use Playwright for reliable rendering of modern JavaScript-heavy sites:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def browse_url(url: str) -> str:
 """Fetch and return readable text from a web page."""
 with sync_playwright() as p:
 browser = p.chromium.launch(headless=True)
 page = browser.new_page()
 page.goto(url, wait_until="networkidle", timeout=30000)
 html = page.content()
 browser.close()

 soup = BeautifulSoup(html, "html.parser")
 # Remove noise
 for tag in soup(["script","style","nav","footer","header"]):
 tag.decompose()
 return soup.get_text(separator="\n", strip=True)[:8000]

Step 3 โ€” Build the Extraction Tool

Use GPT-4o's structured output to reliably extract your schema:

from openai import OpenAI

client = OpenAI()

def extract_structured_data(
 page_text: str,
 company_name: str
) -> CompanyData:
 """Extract structured company data from page text."""
 response = client.beta.chat.completions.parse(
 model="gpt-4o",
 messages=[
 {"role": "system", "content":
 "Extract pricing information from the text. "
 "Use null for fields not found."},
 {"role": "user", "content":
 f"Company: {company_name}\n\n{page_text}"}
 ],
 response_format=CompanyData
 )
 return response.choices[0].message.parsed

Step 4 โ€” Add Rate Limiting and Error Handling

Production agents fail without this. Most websites block rapid requests, and API calls fail intermittently:

import time
import random
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3),
 wait=wait_exponential(multiplier=1, min=2, max=10))
def browse_url_with_retry(url: str) -> str:
 time.sleep(random.uniform(1, 3)) # polite delay
 return browse_url(url)

Common Pitfalls and How to Avoid Them

Pitfall What Happens Fix
Using requests for JS pages Gets empty/broken content Use Playwright or Selenium
No rate limiting IP gets blocked, agent fails silently Random 1โ€“3s delays, retry logic
Passing raw HTML to LLM Fills context window with noise Strip HTML, limit to 8k chars
No schema validation Hallucinated or missing data saved to CRM Pydantic models, GPT-4o structured output
No raw data storage Can't debug or re-extract without re-crawling Store raw text alongside extracted JSON
Ignoring robots.txt Legal and reputational risk Check robots.txt; use official APIs where available

Infrastructure and Cost

Component Option Monthly Cost
Compute (Playwright) Cloud Run or ยฃ15/mo VPS ยฃ10โ€“ยฃ30
Search API Serper (1000 queries/mo free) ยฃ0โ€“ยฃ25
LLM (GPT-4o) ~ยฃ0.01 per extraction ยฃ10โ€“ยฃ80 (volume dependent)
Database Supabase (free tier) ยฃ0โ€“ยฃ20
Monitoring (LangSmith) Developer plan ยฃ0โ€“ยฃ30

Running cost for a typical web research agent processing 500 pages/month: ยฃ30โ€“ยฃ150/month. Build time: 2โ€“4 weeks for a developer; 4โ€“8 weeks for a full production system with monitoring and error handling.

Want Us to Build It For You?

We build production-grade web research agents with proper error handling, monitoring, and CRM integration. Tell us what data you need to collect.

Get a Quote