How to Build an AI Agent That Browses the Web and Extracts Data
A practical guide to building a web research agent: the architecture, tools, code patterns, and the pitfalls that trip up most teams on their first build.
TL;DR
- A web-browsing agent needs 4 tools: search, browse, extract, save
- Use Playwright (not Requests) for JavaScript-rendered pages
- GPT-4o with structured output is the best extraction model
- Rate limiting and retry logic are non-negotiable for production
- Always store raw HTML alongside extracted data โ you'll need it for debugging
- The hardest part is not the code โ it's defining exactly what "good data" looks like
What a Web Research Agent Actually Does
A web research agent is an AI system that can receive a research goal, autonomously search the web, browse relevant pages, extract specific structured data, and save it somewhere useful โ all without a human clicking anything.
Examples of what businesses use these for:
- Enriching a CRM with competitor pricing, company size, and decision-maker info
- Monitoring price changes across supplier or competitor websites
- Building lead lists from directories, LinkedIn, and industry sites
- Tracking job postings to understand competitor hiring signals
- Extracting product data from e-commerce sites for price comparison
- Collecting regulatory or compliance updates from government websites
The Architecture: Four Tools You Need
Tool 1: web_search
Returns a list of URLs and snippets for a query. The agent uses this to find the right pages to browse, rather than guessing URLs.
Implementation: Serper API (ยฃ0.001/query) or Bing Search API. Returns top 10 results with title, URL, and snippet.
Tool 2: browse_url
Fetches a URL and returns cleaned, readable text. This is the core browsing capability โ it's what lets the agent "read" a web page.
Implementation: Playwright for JS-rendered pages; requests + BeautifulSoup for static HTML. Strip nav, footers, ads before returning.
Tool 3: extract_structured_data
Takes raw page text and extracts specific fields as structured JSON. This is where GPT-4o does its work โ reading unstructured content and pulling out exactly what you defined.
Implementation: GPT-4o with structured output (JSON mode) and a schema defining the fields to extract. Pydantic models work well here.
Tool 4: save_data
Persists the extracted data to the destination: CRM, database, spreadsheet, or webhook. Includes validation against your schema before writing.
Implementation: Depends on destination โ HubSpot API, PostgreSQL via SQLAlchemy, Google Sheets API, or a webhook to n8n/Zapier.
Step-by-Step Build Guide
Step 1 โ Define the Data Schema First
Before writing any code, define exactly what data you want. Vague goals produce vague results.
from pydantic import BaseModel from typing import Optional class CompanyData(BaseModel): company_name: str website: str pricing_tier_starter: Optional[str] pricing_tier_pro: Optional[str] pricing_tier_enterprise: Optional[str] free_trial_available: bool last_updated: str # ISO date
Step 2 โ Build the browse_url Tool
Use Playwright for reliable rendering of modern JavaScript-heavy sites:
from playwright.sync_api import sync_playwright from bs4 import BeautifulSoup def browse_url(url: str) -> str: """Fetch and return readable text from a web page.""" with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto(url, wait_until="networkidle", timeout=30000) html = page.content() browser.close() soup = BeautifulSoup(html, "html.parser") # Remove noise for tag in soup(["script","style","nav","footer","header"]): tag.decompose() return soup.get_text(separator="\n", strip=True)[:8000]
Step 3 โ Build the Extraction Tool
Use GPT-4o's structured output to reliably extract your schema:
from openai import OpenAI
client = OpenAI()
def extract_structured_data(
page_text: str,
company_name: str
) -> CompanyData:
"""Extract structured company data from page text."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content":
"Extract pricing information from the text. "
"Use null for fields not found."},
{"role": "user", "content":
f"Company: {company_name}\n\n{page_text}"}
],
response_format=CompanyData
)
return response.choices[0].message.parsed
Step 4 โ Add Rate Limiting and Error Handling
Production agents fail without this. Most websites block rapid requests, and API calls fail intermittently:
import time import random from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10)) def browse_url_with_retry(url: str) -> str: time.sleep(random.uniform(1, 3)) # polite delay return browse_url(url)
Common Pitfalls and How to Avoid Them
| Pitfall | What Happens | Fix |
|---|---|---|
| Using requests for JS pages | Gets empty/broken content | Use Playwright or Selenium |
| No rate limiting | IP gets blocked, agent fails silently | Random 1โ3s delays, retry logic |
| Passing raw HTML to LLM | Fills context window with noise | Strip HTML, limit to 8k chars |
| No schema validation | Hallucinated or missing data saved to CRM | Pydantic models, GPT-4o structured output |
| No raw data storage | Can't debug or re-extract without re-crawling | Store raw text alongside extracted JSON |
| Ignoring robots.txt | Legal and reputational risk | Check robots.txt; use official APIs where available |
Infrastructure and Cost
| Component | Option | Monthly Cost |
|---|---|---|
| Compute (Playwright) | Cloud Run or ยฃ15/mo VPS | ยฃ10โยฃ30 |
| Search API | Serper (1000 queries/mo free) | ยฃ0โยฃ25 |
| LLM (GPT-4o) | ~ยฃ0.01 per extraction | ยฃ10โยฃ80 (volume dependent) |
| Database | Supabase (free tier) | ยฃ0โยฃ20 |
| Monitoring (LangSmith) | Developer plan | ยฃ0โยฃ30 |
Running cost for a typical web research agent processing 500 pages/month: ยฃ30โยฃ150/month. Build time: 2โ4 weeks for a developer; 4โ8 weeks for a full production system with monitoring and error handling.
Want Us to Build It For You?
We build production-grade web research agents with proper error handling, monitoring, and CRM integration. Tell us what data you need to collect.
Get a Quote