How do I scrape my website content for a RAG chatbot?

The simplest approach is to use Python with BeautifulSoup to crawl your sitemap, extract page text, and clean the HTML. For JavaScript-heavy sites (SPAs built in React/Vue), use Playwright to render pages before scraping. Export the raw text to JSON or markdown files that can then be chunked and embedded.

Which vector database should I use for my website chatbot?

For most website chatbots, ChromaDB is the best starting point — it is free, self-hosted, and requires no external accounts. For production deployments needing managed infrastructure and high availability, Pinecone is the most popular choice. If you already use PostgreSQL, pgvector is an excellent option that avoids adding a new database system.

How do I keep my chatbot's knowledge base updated when my website changes?

Set up a scheduled job (cron) to re-scrape and re-embed changed pages on a regular cadence — daily or weekly depending on how frequently your content changes. Use content hashing to detect which pages have changed and only re-embed those, rather than rebuilding the entire index each time. Store the last-scraped timestamp per URL to manage incremental updates efficiently.

AI Chatbots

How to Train a Chatbot on Your Website Content Using RAG

Last updated: 2026-05-23

Your website already contains the answers your customers are looking for. This guide shows you how to turn that content into a searchable knowledge base that powers an AI chatbot — covering every technical step from scraping to production deployment.

23 May 2026 By SpiderHunts Technologies 17 min read

TL;DR

RAG (Retrieval-Augmented Generation) is the right architecture for website-content chatbots. Scrape your site with BeautifulSoup, chunk the text, create embeddings with OpenAI, store them in ChromaDB or Pinecone, and retrieve relevant chunks at query time to pass to an LLM. This article covers every step with working Python code.

Why RAG Is Better Than Fine-Tuning for Website Content

Fine-tuning an LLM on your website content might seem intuitive — if you train the model on your data, it will know your data. But in practice, fine-tuning has critical drawbacks for this use case: it is expensive (requires GPU compute), it takes hours to days to complete, and — most importantly — you need to redo it every time your website content changes.

RAG solves all three problems. It keeps the base LLM unchanged (so no retraining cost) and instead maintains a searchable database of your content that can be updated in minutes when pages change. The LLM retrieves relevant content at query time, so it always has access to the most current version of your knowledge base.

Step 1 — Scrape Your Website Content

The first step is extracting the text content from your website pages. For static HTML sites, BeautifulSoup is the standard tool. For JavaScript-rendered sites (React, Vue, Next.js), you need Playwright or Selenium to render the page before extraction.

import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import json

def scrape_sitemap(sitemap_url: str) -> list[dict]:
 """Extract all URLs from sitemap and scrape their content."""
 # Parse sitemap
 resp = requests.get(sitemap_url, timeout=10)
 root = ET.fromstring(resp.content)
 ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
 urls = [loc.text for loc in root.findall(".//sm:loc", ns)]

 pages = []
 for url in urls:
 try:
 page_resp = requests.get(url, timeout=10)
 soup = BeautifulSoup(page_resp.text, "html.parser")

 # Remove nav, footer, scripts
 for tag in soup(["nav", "footer", "script", "style", "header"]):
 tag.decompose()

 # Extract clean text
 text = soup.get_text(separator="\n", strip=True)
 title = soup.title.string if soup.title else url

 if len(text) > 100: # Skip near-empty pages
 pages.append({
 "url": url,
 "title": title,
 "content": text
 })
 except Exception as e:
 print(f"Failed to scrape {url}: {e}")

 return pages

pages = scrape_sitemap("https://yoursite.com/sitemap.xml")
print(f"Scraped {len(pages)} pages")

Step 2 — Chunk the Text

LLMs have limited context windows, and large documents must be split into smaller chunks for retrieval. The chunking strategy significantly affects retrieval quality.

Fixed-size chunking — Split text every N tokens (e.g., 500), with overlap (e.g., 50 tokens) between adjacent chunks. Simple and fast, but can split sentences mid-thought.

Semantic chunking — Split at paragraph or sentence boundaries, keeping semantically related content together. More complex but produces better retrieval results. LangChain's RecursiveCharacterTextSplitter is the most widely used implementation.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
 chunk_size=600, # characters per chunk
 chunk_overlap=80, # overlap between chunks
 separators=["\n\n", "\n", ". ", " ", ""]
)

def chunk_pages(pages: list[dict]) -> list[dict]:
 chunks = []
 for page in pages:
 splits = splitter.split_text(page["content"])
 for i, chunk_text in enumerate(splits):
 chunks.append({
 "id": f"{page['url']}#chunk{i}",
 "text": chunk_text,
 "source_url": page["url"],
 "source_title": page["title"]
 })
 return chunks

chunks = chunk_pages(pages)
print(f"Created {len(chunks)} chunks from {len(pages)} pages")

Step 3 — Create Embeddings

An embedding converts a piece of text into a numerical vector that captures its semantic meaning. Use OpenAI's text-embedding-3-small for most use cases (1536 dimensions, very affordable) or text-embedding-3-large for complex technical domains where higher semantic precision is needed.

Step 4 — Store in Vector Database

Choose your vector database based on your deployment requirements:

Vector Database Comparison for Website Chatbots — 2026
Database	Type	Cost	Best For	Scalability
ChromaDB	Self-hosted	Free	Development, small deployments	Up to ~100k vectors
Pinecone	Managed cloud	$0–$70+/month	Production, any scale	Billions of vectors
Weaviate	Self-hosted / Cloud	Free / Pay-as-use	Hybrid search (vector + keyword)	Very high
pgvector	PostgreSQL extension	Hosting only	Existing PostgreSQL users	Up to ~1M vectors efficiently

Step 5 — Full RAG Pipeline: Query to Response

import openai
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

client = openai.OpenAI(api_key="your-key")
chroma = chromadb.Client()
ef = OpenAIEmbeddingFunction(api_key="your-key", model_name="text-embedding-3-small")
collection = chroma.get_or_create_collection("website_content", embedding_function=ef)

# Ingest chunks (run once / on schedule)
def ingest(chunks: list[dict]):
 collection.add(
 ids=[c["id"] for c in chunks],
 documents=[c["text"] for c in chunks],
 metadatas=[{"url": c["source_url"], "title": c["source_title"]} for c in chunks]
 )

# RAG query function
def ask(question: str, top_k: int = 5) -> str:
 results = collection.query(query_texts=[question], n_results=top_k)
 docs = results["documents"][0]
 sources = [m["url"] for m in results["metadatas"][0]]

 context = "\n\n---\n\n".join(
 f"[Source: {src}]\n{doc}" for doc, src in zip(docs, sources)
 )

 messages = [
 {
 "role": "system",
 "content": (
 "You are a helpful website assistant. Answer questions using ONLY "
 "the provided context. If the answer is not in the context, say "
 "'I couldn\\'t find that information on our website.' "
 "Always be concise and accurate."
 )
 },
 {
 "role": "user",
 "content": f"Context:\n{context}\n\nQuestion: {question}"
 }
 ]

 response = client.chat.completions.create(
 model="gpt-4o",
 messages=messages,
 temperature=0.1
 )
 return response.choices[0].message.content

# Example
answer = ask("What services do you offer?")
print(answer)

Handling No-Match Cases

When a user asks something that is not in your knowledge base, the chatbot must handle it gracefully rather than hallucinating an answer. Implement a confidence threshold: if the top retrieval result has a similarity score below 0.7 (on a 0–1 scale), respond with a fallback message like "I don't have information on that on our website — please contact us at support@yourcompany.com."

You can check the similarity score from ChromaDB's query result via the distances field (lower distance = higher similarity in cosine space). For Pinecone, the score field returns cosine similarity directly (higher = more similar).

Keeping the Knowledge Base Updated

Website content changes. Prices update, policies change, new pages are added. Your chatbot needs to reflect these changes promptly. The recommended approach:

Run a daily cron job that re-scrapes all pages in your sitemap
Hash each page's content and compare to the previously stored hash
For changed or new pages: delete old chunks from the vector DB, re-chunk and re-embed
For deleted pages: remove all associated chunks from the vector DB
Log all updates with timestamps for auditability

Want a Website Chatbot Built for You?

We build production-ready website chatbots using RAG — trained on your content, deployed on your infrastructure, and updated automatically as your site changes. Get a scoped quote in 24 hours.

Get a Free Quote

AI Chatbots The Complete Guide to AI Chatbots for Business (2026) AI Chatbots ChatGPT vs Custom AI Chatbot — Which is Better? AI Chatbots How to Build an AI Chatbot Trained on Your Own Business Data

🤖 More in AI & Machine Learning

How to Train a Chatbot on Your Website Content Using RAG

Why RAG Is Better Than Fine-Tuning for Website Content

Step 1 — Scrape Your Website Content

Step 2 — Chunk the Text

Step 3 — Create Embeddings

Step 4 — Store in Vector Database

Step 5 — Full RAG Pipeline: Query to Response

Handling No-Match Cases

Keeping the Knowledge Base Updated

Want a Website Chatbot Built for You?

Related Articles

Continue reading

AI Coding Tools 2026: Cursor vs GitHub Copilot vs Windsurf vs Claude Code

LLM API Comparison 2026: OpenAI vs Anthropic vs Google Gemini for SaaS

Vector Database Comparison 2026: Pinecone vs Weaviate vs Qdrant vs pg_vector

AI Automation Agency: What It Is, What to Look For, and What It Costs in 2026