How to Train a Chatbot on Your Website Content Using RAG
Your website already contains the answers your customers are looking for. This guide shows you how to turn that content into a searchable knowledge base that powers an AI chatbot — covering every technical step from scraping to production deployment.
RAG (Retrieval-Augmented Generation) is the right architecture for website-content chatbots. Scrape your site with BeautifulSoup, chunk the text, create embeddings with OpenAI, store them in ChromaDB or Pinecone, and retrieve relevant chunks at query time to pass to an LLM. This article covers every step with working Python code.
Why RAG Is Better Than Fine-Tuning for Website Content
Fine-tuning an LLM on your website content might seem intuitive — if you train the model on your data, it will know your data. But in practice, fine-tuning has critical drawbacks for this use case: it is expensive (requires GPU compute), it takes hours to days to complete, and — most importantly — you need to redo it every time your website content changes.
RAG solves all three problems. It keeps the base LLM unchanged (so no retraining cost) and instead maintains a searchable database of your content that can be updated in minutes when pages change. The LLM retrieves relevant content at query time, so it always has access to the most current version of your knowledge base.
Step 1 — Scrape Your Website Content
The first step is extracting the text content from your website pages. For static HTML sites, BeautifulSoup is the standard tool. For JavaScript-rendered sites (React, Vue, Next.js), you need Playwright or Selenium to render the page before extraction.
import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import json
def scrape_sitemap(sitemap_url: str) -> list[dict]:
"""Extract all URLs from sitemap and scrape their content."""
# Parse sitemap
resp = requests.get(sitemap_url, timeout=10)
root = ET.fromstring(resp.content)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = [loc.text for loc in root.findall(".//sm:loc", ns)]
pages = []
for url in urls:
try:
page_resp = requests.get(url, timeout=10)
soup = BeautifulSoup(page_resp.text, "html.parser")
# Remove nav, footer, scripts
for tag in soup(["nav", "footer", "script", "style", "header"]):
tag.decompose()
# Extract clean text
text = soup.get_text(separator="\n", strip=True)
title = soup.title.string if soup.title else url
if len(text) > 100: # Skip near-empty pages
pages.append({
"url": url,
"title": title,
"content": text
})
except Exception as e:
print(f"Failed to scrape {url}: {e}")
return pages
pages = scrape_sitemap("https://yoursite.com/sitemap.xml")
print(f"Scraped {len(pages)} pages")
Step 2 — Chunk the Text
LLMs have limited context windows, and large documents must be split into smaller chunks for retrieval. The chunking strategy significantly affects retrieval quality.
Fixed-size chunking — Split text every N tokens (e.g., 500), with overlap (e.g., 50 tokens) between adjacent chunks. Simple and fast, but can split sentences mid-thought.
Semantic chunking — Split at paragraph or sentence boundaries, keeping semantically related content together. More complex but produces better retrieval results. LangChain's RecursiveCharacterTextSplitter is the most widely used implementation.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=600, # characters per chunk
chunk_overlap=80, # overlap between chunks
separators=["\n\n", "\n", ". ", " ", ""]
)
def chunk_pages(pages: list[dict]) -> list[dict]:
chunks = []
for page in pages:
splits = splitter.split_text(page["content"])
for i, chunk_text in enumerate(splits):
chunks.append({
"id": f"{page['url']}#chunk{i}",
"text": chunk_text,
"source_url": page["url"],
"source_title": page["title"]
})
return chunks
chunks = chunk_pages(pages)
print(f"Created {len(chunks)} chunks from {len(pages)} pages")
Step 3 — Create Embeddings
An embedding converts a piece of text into a numerical vector that captures its semantic meaning. Use OpenAI's text-embedding-3-small for most use cases (1536 dimensions, very affordable) or text-embedding-3-large for complex technical domains where higher semantic precision is needed.
Step 4 — Store in Vector Database
Choose your vector database based on your deployment requirements:
| Database | Type | Cost | Best For | Scalability |
|---|---|---|---|---|
| ChromaDB | Self-hosted | Free | Development, small deployments | Up to ~100k vectors |
| Pinecone | Managed cloud | $0–$70+/month | Production, any scale | Billions of vectors |
| Weaviate | Self-hosted / Cloud | Free / Pay-as-use | Hybrid search (vector + keyword) | Very high |
| pgvector | PostgreSQL extension | Hosting only | Existing PostgreSQL users | Up to ~1M vectors efficiently |
Step 5 — Full RAG Pipeline: Query to Response
import openai
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
client = openai.OpenAI(api_key="your-key")
chroma = chromadb.Client()
ef = OpenAIEmbeddingFunction(api_key="your-key", model_name="text-embedding-3-small")
collection = chroma.get_or_create_collection("website_content", embedding_function=ef)
# Ingest chunks (run once / on schedule)
def ingest(chunks: list[dict]):
collection.add(
ids=[c["id"] for c in chunks],
documents=[c["text"] for c in chunks],
metadatas=[{"url": c["source_url"], "title": c["source_title"]} for c in chunks]
)
# RAG query function
def ask(question: str, top_k: int = 5) -> str:
results = collection.query(query_texts=[question], n_results=top_k)
docs = results["documents"][0]
sources = [m["url"] for m in results["metadatas"][0]]
context = "\n\n---\n\n".join(
f"[Source: {src}]\n{doc}" for doc, src in zip(docs, sources)
)
messages = [
{
"role": "system",
"content": (
"You are a helpful website assistant. Answer questions using ONLY "
"the provided context. If the answer is not in the context, say "
"'I couldn\\'t find that information on our website.' "
"Always be concise and accurate."
)
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.1
)
return response.choices[0].message.content
# Example
answer = ask("What services do you offer?")
print(answer)
Handling No-Match Cases
When a user asks something that is not in your knowledge base, the chatbot must handle it gracefully rather than hallucinating an answer. Implement a confidence threshold: if the top retrieval result has a similarity score below 0.7 (on a 0–1 scale), respond with a fallback message like "I don't have information on that on our website — please contact us at support@yourcompany.com."
You can check the similarity score from ChromaDB's query result via the distances field (lower distance = higher similarity in cosine space). For Pinecone, the score field returns cosine similarity directly (higher = more similar).
Keeping the Knowledge Base Updated
Website content changes. Prices update, policies change, new pages are added. Your chatbot needs to reflect these changes promptly. The recommended approach:
- Run a daily cron job that re-scrapes all pages in your sitemap
- Hash each page's content and compare to the previously stored hash
- For changed or new pages: delete old chunks from the vector DB, re-chunk and re-embed
- For deleted pages: remove all associated chunks from the vector DB
- Log all updates with timestamps for auditability
Want a Website Chatbot Built for You?
We build production-ready website chatbots using RAG — trained on your content, deployed on your infrastructure, and updated automatically as your site changes. Get a scoped quote in 24 hours.
Get a Free Quote