Back to Blog
Web Scraping

Scrapy vs Playwright vs Selenium: Python Web Scraping Tools Compared

By SpiderHunts Technologies  ·  May 30, 2026  ·  13 min read

TL;DR

Picking the wrong Python scraping tool can multiply project costs by 5x. Scrapy wins for high-volume static crawls (1000-4000 pages per minute), Playwright wins for modern JavaScript sites and anti-bot bypass, Selenium remains the safe legacy choice with the broadest browser support, and BeautifulSoup plus Requests handles quick one-off parsing. This guide gives you code samples, performance benchmarks, anti-detection techniques, and a clear decision framework drawn from SpiderHunts engineering experience.

Every serious Python web scraping project starts with the same decision: which tool to use. Get it right and the project ships in a fraction of the time. Get it wrong and you spend weeks fighting your own framework instead of the target website.

The four candidates that dominate Python scraping in 2026 are Scrapy, Playwright, Selenium, and the BeautifulSoup-plus-Requests combination. Each excels in a different regime - and the wrong choice for the regime will cost you. This article is the practical comparison the SpiderHunts engineering team wishes they could send to every prospective client.

We will cover what each tool actually does, where its sweet spot lies, real benchmark numbers, anti-bot capability in 2026, code samples, and a clear decision tree. By the end you should be able to confidently pick the right tool for any scraping project.

Why Tool Choice Matters More Than Most Teams Realise

A poorly matched tool costs in three ways. First, raw throughput - if you use Selenium for what Scrapy could do, your crawl takes 30 times longer. Second, infrastructure - Playwright using a full Chromium for each thread consumes roughly 200 MB of RAM per worker, versus Scrapy's lean 40 MB. Third, reliability - the wrong tool against an aggressive anti-bot stack will see your crawl die in the first hour and never recover.

Conversely, the right tool gives compounding returns. A correctly chosen and configured Scrapy crawl can run for months on a single small VPS, quietly producing millions of clean records. A well-tuned Playwright pipeline can defeat anti-bot stacks that look impenetrable to less experienced teams. Tool selection is where senior engineers earn their keep.

Scrapy Deep Dive

Scrapy is a full crawling framework, not just a library. Built on the Twisted async networking engine, it has been the gold standard for high-volume Python crawling since 2008. Version 2.11 (the current stable in 2026) added native asyncio integration and improved HTTP/2 support.

What Scrapy Gets Right

  • Speed: Async concurrency, configurable from a handful to several thousand simultaneous requests.
  • Built-in scheduling: Politeness controls, automatic deduplication of URLs across the crawl, request prioritisation.
  • Item pipelines: Clean separation of crawl, parse, validate, and persist stages.
  • Middlewares: Hook in proxy rotation, user-agent rotation, headers, retry logic, and custom auth without touching spider code.
  • Selectors: Native XPath and CSS selectors backed by parsel - faster and more powerful than BeautifulSoup.
  • Resume and persist state: A crashed crawl can resume from where it stopped with one CLI flag.
  • Ecosystem: Scrapy Cloud, scrapy-splash, scrapy-playwright integration, scrapyd for deployment.

Where Scrapy Falls Short

  • JavaScript: Scrapy does not execute JS. You must pair it with scrapy-playwright or detect underlying JSON endpoints.
  • Learning curve: The spider/item/pipeline/middleware model is more conceptual than the simpler libraries.
  • Anti-bot evasion alone: Without a paired browser or commercial unblocker service, Scrapy cannot pass JS challenges like Cloudflare's Turnstile.
  • Awkward debugging: Twisted stack traces can be intimidating for newer developers.

Scrapy Code Sample

A minimal Scrapy spider extracting product names and prices:

import scrapy

class ProductSpider(scrapy.Spider):
 name = "products"
 start_urls = ["https://example.com/category/widgets"]
 custom_settings = {
 "DOWNLOAD_DELAY": 2,
 "CONCURRENT_REQUESTS": 8,
 "USER_AGENT": "Mozilla/5.0 (compatible; ResearchBot/1.0)",
 }

 def parse(self, response):
 for product in response.css("div.product-card"):
 yield {
 "name": product.css("h3::text").get(),
 "price": product.css(".price::text").get(),
 "url": response.urljoin(product.css("a::attr(href)").get()),
 }
 next_page = response.css("a.next::attr(href)").get()
 if next_page:
 yield response.follow(next_page, self.parse)

That is the complete spider. Run with scrapy crawl products -o products.json and you get a clean JSON dataset.

Playwright Deep Dive

Playwright is Microsoft's modern browser automation library, released in 2020 and rapidly become the favourite of professional scrapers by 2024. The Python binding (playwright-python) provides full Chromium, Firefox, and WebKit control with both sync and async APIs.

What Playwright Gets Right

  • JavaScript execution: Full Chromium engine handles even the most complex SPAs perfectly.
  • Auto-waiting: Built-in waits for elements to appear, become visible, or stop animating - eliminates flaky selector code.
  • Anti-detection: playwright-stealth and rebrowser-playwright leak fewer browser fingerprint signals than Selenium.
  • Network interception: Can capture, modify, or block API calls a page makes - often more efficient than scraping rendered HTML.
  • Multi-browser: Chromium, Firefox, and WebKit (Safari engine) with one API.
  • Tracing and screenshots: Built-in debugging tools that make complex failures observable.
  • Modern syntax: Async/await first, clean Pythonic API.

Where Playwright Falls Short

  • Resource cost: Each browser context uses 150 to 300 MB of RAM. Running hundreds in parallel needs serious infrastructure.
  • Speed ceiling: Even fast machines top out around 200 to 300 pages per minute per worker due to rendering overhead.
  • No native crawler framework: You build the scheduler, dedupe, and pipeline yourself or pair with another tool.
  • Dependency footprint: Each browser install is roughly 300 MB. Docker images bloat quickly.

Playwright Code Sample

from playwright.async_api import async_playwright
import asyncio, json

async def scrape():
 async with async_playwright() as p:
 browser = await p.chromium.launch(headless=True)
 context = await browser.new_context(
 user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
 viewport={"width": 1366, "height": 768},
 )
 page = await context.new_page()
 await page.goto("https://example.com/category/widgets", wait_until="networkidle")
 await page.wait_for_selector("div.product-card")

 items = await page.evaluate("""
 () => Array.from(document.querySelectorAll('div.product-card')).map(el => ({
 name: el.querySelector('h3')?.innerText,
 price: el.querySelector('.price')?.innerText,
 url: el.querySelector('a')?.href,
 }))
 """)
 with open("products.json", "w") as f:
 json.dump(items, f, indent=2)
 await browser.close()

asyncio.run(scrape())

Playwright is verbose compared to Scrapy but vastly more capable against JavaScript-rendered targets.

Selenium Deep Dive

Selenium is the elder statesman - originally a browser test automation tool from 2004, repurposed for scraping by an entire generation of developers. Selenium 4 (released 2021, maintained actively through 2026) modernised the API with the W3C WebDriver protocol.

What Selenium Gets Right

  • Maturity: 20-plus years of community knowledge. Almost every problem has a documented solution.
  • Broad browser support: Chrome, Firefox, Edge, Safari, Internet Explorer (still, in 2026), and obscure mobile browser drivers.
  • undetected-chromedriver: A community fork that strips automation fingerprints - the gold standard tool for Selenium-based scraping against anti-bot stacks.
  • Grid: Selenium Grid for distributing across many machines is battle-tested at enterprise scale.
  • Cross-language: If your stack is partly Java,.NET, Ruby, or JavaScript, Selenium gives identical APIs in all of them.

Where Selenium Falls Short

  • Speed: The WebDriver protocol has more round trips than Playwright's CDP-based approach.
  • Manual waits: Without explicit WebDriverWait calls, scripts are brittle.
  • Anti-detection: Stock Selenium leaks the navigator.webdriver flag and other fingerprints. Workarounds exist but are an ongoing arms race.
  • Async support: Native async is limited compared to Playwright's first-class asyncio API.

Selenium Code Sample

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--user-agent=Mozilla/5.0 ResearchBot/1.0")
driver = webdriver.Chrome(options=options)

driver.get("https://example.com/category/widgets")
WebDriverWait(driver, 10).until(
 EC.presence_of_element_located((By.CSS_SELECTOR, "div.product-card"))
)

items = []
for card in driver.find_elements(By.CSS_SELECTOR, "div.product-card"):
 items.append({
 "name": card.find_element(By.CSS_SELECTOR, "h3").text,
 "price": card.find_element(By.CSS_SELECTOR, ".price").text,
 "url": card.find_element(By.CSS_SELECTOR, "a").get_attribute("href"),
 })

with open("products.json", "w") as f:
 json.dump(items, f, indent=2)
driver.quit()

BeautifulSoup and Requests: The Simple Pair

For straightforward scraping jobs against static sites, the BeautifulSoup-plus-Requests combination remains unmatched for simplicity. Requests makes the HTTP call, BeautifulSoup parses the HTML, and you write a few lines of Python. No framework, no scheduler, no browser.

When to Reach for BeautifulSoup

  • One-off extractions of a handful to a few hundred pages.
  • Prototyping before committing to a full Scrapy or Playwright project.
  • Hitting JSON APIs directly where there is no rendering to do.
  • Learning the basics of HTML parsing before moving to heavier frameworks.

When to Skip It

  • Anything JavaScript-rendered - BeautifulSoup cannot help.
  • High-volume crawls - you will rebuild Scrapy badly.
  • Sites with anti-bot protection - Requests offers no fingerprint defence.

BeautifulSoup Code Sample

import requests
from bs4 import BeautifulSoup
import json

r = requests.get(
 "https://example.com/category/widgets",
 headers={"User-Agent": "Mozilla/5.0 ResearchBot/1.0"},
 timeout=15,
)
soup = BeautifulSoup(r.text, "lxml")

items = []
for card in soup.select("div.product-card"):
 items.append({
 "name": card.select_one("h3").get_text(strip=True),
 "price": card.select_one(".price").get_text(strip=True),
 "url": card.select_one("a")["href"],
 })

with open("products.json", "w") as f:
 json.dump(items, f, indent=2)

Performance Benchmarks (2026)

Numbers below are internal SpiderHunts benchmarks on a 4-vCPU, 8 GB RAM Linux VM crawling a static product catalogue and a JavaScript-heavy SPA. All tools used residential proxies with two-second average delay.

Tool Static Site (pages/min) JS SPA (pages/min) RAM per worker
Scrapy 2800 N/A (cannot render) 40 MB
Playwright 320 240 220 MB
Selenium 140 95 280 MB
Requests + BS4 1100 N/A (cannot render) 25 MB
Scrapy + Playwright 2400 (rendering disabled) 280 240 MB (when rendering)

The pattern is consistent: Scrapy dominates for static, Playwright leads when rendering is needed, Selenium trails Playwright in both regimes, and Requests is fastest for tiny jobs but lacks the crawl machinery for serious projects.

Anti-Bot Bypass: Tool by Tool

This is where 2026 looks dramatically different from 2022. Anti-bot vendors (Cloudflare, DataDome, Akamai Bot Manager, PerimeterX) have raised the bar significantly. Here is how each tool stacks up.

Scrapy

Cannot defeat JavaScript challenges alone. Production setups pair Scrapy with scrapy-playwright, scrapy-impersonate (for TLS fingerprint matching), or commercial unblockers like ScrapingBee. For HTTP/2 fingerprint matching, curl_cffi has become the standard underlying client in 2026.

Playwright

The strongest standalone option. Combined with playwright-stealth or rebrowser-playwright, plus residential proxies and realistic browser contexts (viewport, locale, time zone, hardware concurrency), it defeats most commercial bot detection in default configurations. For Cloudflare Turnstile specifically, the combination of patched chromium builds plus interactive mouse movement plus residential IPs is what works.

Selenium

undetected-chromedriver remains effective but lags Playwright's stealth ecosystem by six to twelve months. Selenium-stealth and SeleniumBase UC mode are the other common picks. For very high-protection targets, Selenium is rarely the right tool in 2026.

Requests / httpx

Plain Requests is detectable instantly by any modern stack. curl_cffi and tls-client offer JA3/JA4 TLS fingerprint impersonation that makes Python HTTP calls indistinguishable from real Chrome - a vital tool when you need raw HTTP speed but must look like a browser at the network level.

Decision Framework

Use this checklist to pick the right tool in under two minutes.

Pick Scrapy When

  • Crawl volume is more than 10,000 pages.
  • Target site is mostly static HTML or has clean JSON APIs.
  • You need persistent scheduling, dedupe, and pipelines out of the box.
  • Multiple spiders will live in the same project.
  • Infrastructure cost matters.

Pick Playwright When

  • Target site renders content with JavaScript (React, Vue, Angular, Svelte).
  • There is anti-bot protection that requires browser-level evasion.
  • You need to interact with the page (click, type, scroll, hover).
  • Network interception of XHR or fetch calls would simplify the job.
  • Volume is moderate (up to a few hundred thousand pages).

Pick Selenium When

  • Team already has deep Selenium expertise.
  • You must support an unusual browser (legacy IE, certain mobile drivers).
  • Cross-language consistency with existing Java or.NET test infrastructure matters.
  • Project will use Selenium Grid for distributed execution.

Pick BeautifulSoup plus Requests When

  • Job is one-off and small (under a few hundred pages).
  • Target is static HTML.
  • No anti-bot considerations.
  • You are prototyping before scaling up.

Combine Tools When

The strongest real-world pipelines mix tools. Scrapy plus scrapy-playwright is the most common production combination: Scrapy handles the crawl orchestration and pipeline machinery, while Playwright renders the small percentage of pages that require JS. For maximum throughput against modern protected sites, a hybrid using curl_cffi for HTTP fingerprint matching combined with Playwright for challenge-page passes delivers the best price-performance.

Real-World Recommendations from SpiderHunts Engineers

After shipping more than a hundred scraping projects since 2015, the SpiderHunts engineering team's defaults look like this:

  • For lead generation projects on business directories: Scrapy with curl_cffi for TLS impersonation. Fast, reliable, cheap to run.
  • For e-commerce competitor monitoring: Scrapy plus scrapy-playwright. Most product list pages are server-rendered (Scrapy fast path), variant pickers and reviews are JS (Playwright fallback).
  • For real estate portals: Playwright direct - these sites are increasingly SPAs and use sophisticated bot detection.
  • For professional networks within ToS: Playwright with full stealth stack and per-account rate limiting. Treat carefully.
  • For news and PR monitoring: Scrapy with sitemap-driven URL discovery. RSS feeds where available for efficiency.
  • For quick client one-offs: Requests plus BeautifulSoup. Two-hour delivery, no framework overhead.

The single biggest engineering lesson: build the pipeline assuming the target site will change selectors every few months. Resilient extractors, structured logging, and quick rollback are worth more than raw scraping speed once a system is in production.

Final Verdict

There is no single best Python scraping tool - the answer depends on what you are scraping, at what volume, and how aggressive the target's defences are. That said, two tools belong in every serious scraper's toolbox in 2026: Scrapy for high-volume work and Playwright for everything JavaScript-heavy. Selenium remains useful in legacy contexts but is rarely the right new project default. BeautifulSoup plus Requests stays the friend of every quick job.

If you need a custom scraping pipeline built and delivered with the right tool choices baked in, that is exactly what the SpiderHunts Technologies web scraping team does. We pair the tool selection above with battle-tested anti-bot tooling, proxy infrastructure, and CRM delivery pipelines. For the business-side framing of a lead-generation scraping project, see our companion guide on web scraping for lead generation. And for downstream analytics on the data you collect, our data science service turns raw records into business intelligence.

Need a Production Scraping Pipeline?

Talk to the SpiderHunts engineering team - free 30-minute scoping call. We will recommend the right Python tool stack for your target sites and give you a fixed-price quote and timeline.

WhatsApp Us Now Book a Free Meeting

Relevant Services

Services related to this article

Web Scraping Data Science Custom Software