How to Build an AI Agent That Browses the Web and Extracts Data

Last updated: 2026-05-23

A practical guide to building a web research agent: the architecture, tools, code patterns, and the pitfalls that trip up most teams on their first build.

By SpiderHunts Technologies · 23 May 2026 · 11 min read

TL;DR

A web-browsing agent needs 4 tools: search, browse, extract, save
Use Playwright (not Requests) for JavaScript-rendered pages
GPT-4o with structured output is the best extraction model
Rate limiting and retry logic are non-negotiable for production
Always store raw HTML alongside extracted data — you'll need it for debugging
The hardest part is not the code — it's defining exactly what "good data" looks like

What a Web Research Agent Actually Does

A web research agent is an AI system that can receive a research goal, autonomously search the web, browse relevant pages, extract specific structured data, and save it somewhere useful — all without a human clicking anything.

Examples of what businesses use these for:

Enriching a CRM with competitor pricing, company size, and decision-maker info
Monitoring price changes across supplier or competitor websites
Building lead lists from directories, LinkedIn, and industry sites
Tracking job postings to understand competitor hiring signals
Extracting product data from e-commerce sites for price comparison
Collecting regulatory or compliance updates from government websites

The Architecture: Four Tools You Need

Tool 1: web_search

Returns a list of URLs and snippets for a query. The agent uses this to find the right pages to browse, rather than guessing URLs.

Implementation: Serper API (£0.001/query) or Bing Search API. Returns top 10 results with title, URL, and snippet.

Tool 2: browse_url

Fetches a URL and returns cleaned, readable text. This is the core browsing capability — it's what lets the agent "read" a web page.

Implementation: Playwright for JS-rendered pages; requests + BeautifulSoup for static HTML. Strip nav, footers, ads before returning.

Tool 3: extract_structured_data

Takes raw page text and extracts specific fields as structured JSON. This is where GPT-4o does its work — reading unstructured content and pulling out exactly what you defined.

Implementation: GPT-4o with structured output (JSON mode) and a schema defining the fields to extract. Pydantic models work well here.

Tool 4: save_data

Persists the extracted data to the destination: CRM, database, spreadsheet, or webhook. Includes validation against your schema before writing.

Implementation: Depends on destination — HubSpot API, PostgreSQL via SQLAlchemy, Google Sheets API, or a webhook to n8n/Zapier.

Step-by-Step Build Guide

Step 1 — Define the Data Schema First

Before writing any code, define exactly what data you want. Vague goals produce vague results.

from pydantic import BaseModel
from typing import Optional

class CompanyData(BaseModel):
 company_name: str
 website: str
 pricing_tier_starter: Optional[str]
 pricing_tier_pro: Optional[str]
 pricing_tier_enterprise: Optional[str]
 free_trial_available: bool
 last_updated: str # ISO date

Step 2 — Build the browse_url Tool

Use Playwright for reliable rendering of modern JavaScript-heavy sites:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def browse_url(url: str) -> str:
 """Fetch and return readable text from a web page."""
 with sync_playwright() as p:
 browser = p.chromium.launch(headless=True)
 page = browser.new_page()
 page.goto(url, wait_until="networkidle", timeout=30000)
 html = page.content()
 browser.close()

 soup = BeautifulSoup(html, "html.parser")
 # Remove noise
 for tag in soup(["script","style","nav","footer","header"]):
 tag.decompose()
 return soup.get_text(separator="\n", strip=True)[:8000]

Step 3 — Build the Extraction Tool

Use GPT-4o's structured output to reliably extract your schema:

from openai import OpenAI

client = OpenAI()

def extract_structured_data(
 page_text: str,
 company_name: str
) -> CompanyData:
 """Extract structured company data from page text."""
 response = client.beta.chat.completions.parse(
 model="gpt-4o",
 messages=[
 {"role": "system", "content":
 "Extract pricing information from the text. "
 "Use null for fields not found."},
 {"role": "user", "content":
 f"Company: {company_name}\n\n{page_text}"}
 ],
 response_format=CompanyData
 )
 return response.choices[0].message.parsed

Step 4 — Add Rate Limiting and Error Handling

Production agents fail without this. Most websites block rapid requests, and API calls fail intermittently:

import time
import random
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3),
 wait=wait_exponential(multiplier=1, min=2, max=10))
def browse_url_with_retry(url: str) -> str:
 time.sleep(random.uniform(1, 3)) # polite delay
 return browse_url(url)

Common Pitfalls and How to Avoid Them

Pitfall	What Happens	Fix
Using requests for JS pages	Gets empty/broken content	Use Playwright or Selenium
No rate limiting	IP gets blocked, agent fails silently	Random 1–3s delays, retry logic
Passing raw HTML to LLM	Fills context window with noise	Strip HTML, limit to 8k chars
No schema validation	Hallucinated or missing data saved to CRM	Pydantic models, GPT-4o structured output
No raw data storage	Can't debug or re-extract without re-crawling	Store raw text alongside extracted JSON
Ignoring robots.txt	Legal and reputational risk	Check robots.txt; use official APIs where available

Infrastructure and Cost

Component	Option	Monthly Cost
Compute (Playwright)	Cloud Run or £15/mo VPS	£10–£30
Search API	Serper (1000 queries/mo free)	£0–£25
LLM (GPT-4o)	~£0.01 per extraction	£10–£80 (volume dependent)
Database	Supabase (free tier)	£0–£20
Monitoring (LangSmith)	Developer plan	£0–£30

Running cost for a typical web research agent processing 500 pages/month: £30–£150/month. Build time: 2–4 weeks for a developer. Expect 4–8 weeks for a full production system with monitoring and error handling.

Want Us to Build It For You?

We build production-grade web research agents with proper error handling, monitoring, and CRM integration. Tell us what data you need to collect.

Get a Quote

Related Services

Service

AI Agent Development

Custom autonomous AI agents

Bespoke software development

AI Agents What Are AI Agents? The Complete 2026 Guide for Businesses AI Agents AI Agents vs Chatbots: What's the Difference? AI Agents How AI Agents Use Tools, APIs and Memory to Work Autonomously

🤖 More in AI & Machine Learning

How to Build an AI Agent That Browses the Web and Extracts Data

What a Web Research Agent Actually Does

The Architecture: Four Tools You Need

Tool 1: web_search

Tool 2: browse_url

Tool 3: extract_structured_data

Tool 4: save_data

Step-by-Step Build Guide

Step 1 — Define the Data Schema First

Step 2 — Build the browse_url Tool

Step 3 — Build the Extraction Tool

Step 4 — Add Rate Limiting and Error Handling

Common Pitfalls and How to Avoid Them

Infrastructure and Cost

Want Us to Build It For You?

Related Services

Related Articles

Continue reading

AI Coding Tools 2026: Cursor vs GitHub Copilot vs Windsurf vs Claude Code

LLM API Comparison 2026: OpenAI vs Anthropic vs Google Gemini for SaaS

Vector Database Comparison 2026: Pinecone vs Weaviate vs Qdrant vs pg_vector

AI Automation Agency: What It Is, What to Look For, and What It Costs in 2026