Web Scraping with Python 2026: Playwright, BeautifulSoup & Scrapy

Web scraping converts the world's unstructured public data into structured datasets — price monitoring, research, competitive intelligence, training data for AI models. Python is the dominant language for scraping due to its rich ecosystem: httpx and requests for HTTP, BeautifulSoup and lxml for parsing, Playwright for JavaScript-rendered pages, and Scrapy for large-scale crawls. This guide walks through every layer of the stack, from a five-line scraper to a production-grade distributed crawler, including the legal and ethical framework you must understand before writing a single request.

1. When to Scrape vs Use an API

Scraping is often unnecessary. Before writing any scraper, check:

  • Does the site have a public API? (Check /api, /api-docs, developer documentation, or RapidAPI)
  • Is there a data export feature? Many sites provide CSV/JSON exports for their data.
  • Is there a dataset already published? Kaggle, Hugging Face Datasets, data.gov, and Common Crawl often have what you need.
  • Does the site's robots.txt explicitly prohibit crawling? (https://example.com/robots.txt)

If the answer to all of the above is "scraping is still necessary," proceed — but proceed responsibly.

Web scraping legality varies by jurisdiction and depends on what data is accessed and how. Key principles:

  • Publicly available data: Generally permissible in most jurisdictions. The 2022 hiQ Labs v. LinkedIn ruling (9th Circuit) held that scraping publicly accessible data is not a CFAA violation in the US.
  • Respect robots.txt: While robots.txt is not legally binding in most countries, violating it is considered bad practice and can affect ToS litigation. Always read and respect it.
  • Terms of Service: Many sites prohibit scraping in their ToS. Violations can result in access termination and potentially civil litigation. Read the ToS before scraping.
  • Personal data (GDPR/CCPA): If scraping includes personal data (names, emails, phone numbers), GDPR (EU) and CCPA (California) impose strict requirements — you may need a legitimate interest under GDPR Article 6(1)(f) and transparency obligations.
  • Don't overload servers: Sending 1000 requests/second to a small website is effectively a DDoS attack. Rate limit aggressively.

3. BeautifulSoup: Static HTML Scraping

BeautifulSoup parses HTML into a navigable tree. Pair it with httpx (async-capable HTTP client) for efficient scraping of static pages:

import httpx
from bs4 import BeautifulSoup
import time
import random

def scrape_product_prices(urls: list[str]) -> list[dict]:
    results = []
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }

    with httpx.Client(headers=headers, follow_redirects=True, timeout=30) as client:
        for url in urls:
            resp = client.get(url)
            resp.raise_for_status()
            soup = BeautifulSoup(resp.text, "lxml")   # lxml is faster than html.parser

            # Select with CSS selectors
            name = soup.select_one("h1.product-title")
            price = soup.select_one("span[data-price]")

            results.append({
                "url": url,
                "name":  name.get_text(strip=True)  if name  else None,
                "price": price["data-price"]         if price else None,
            })

            # Polite delay: 1–3 seconds between requests
            time.sleep(random.uniform(1.0, 3.0))

    return results

4. Playwright: JavaScript-Rendered Pages

About 70% of modern websites use client-side JavaScript rendering. httpx + BeautifulSoup gets raw HTML before JS runs — useless for SPAs and dynamic content. Playwright controls a real browser (Chromium, Firefox, WebKit):

import asyncio
from playwright.async_api import async_playwright

async def scrape_dynamic_page(url: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,   # set False to watch browser during development
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
            viewport={"width": 1920, "height": 1080},
        )
        page = await context.new_page()

        await page.goto(url, wait_until="networkidle")  # wait for JS to settle

        # Wait for specific element to be visible (more reliable than networkidle)
        await page.wait_for_selector(".product-grid", timeout=10000)

        # Extract content after JS rendering
        products = await page.evaluate("""
            () => Array.from(document.querySelectorAll('.product-card')).map(card => ({
                name: card.querySelector('.name')?.textContent?.trim(),
                price: card.querySelector('.price')?.textContent?.trim(),
                url: card.querySelector('a')?.href,
            }))
        """)

        await browser.close()
        return {"url": url, "products": products}

# Run
result = asyncio.run(scrape_dynamic_page("https://example.com/shop"))

Playwright vs Selenium: Playwright is the modern choice — faster, better async support, more reliable auto-waiting, and supports multiple browser engines. Selenium is still widely used for legacy test suites but Playwright is preferred for new scraping projects.

5. Scrapy: Large-Scale Crawling

For crawling thousands or millions of pages, Scrapy's asynchronous architecture (built on Twisted) outperforms sequential requests/httpx:

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/catalogue/page1.html"]

    custom_settings = {
        "DOWNLOAD_DELAY": 2,          # seconds between requests
        "RANDOMIZE_DOWNLOAD_DELAY": True,  # randomize by ±50%
        "AUTOTHROTTLE_ENABLED": True,  # adaptive throttling based on server response
        "ROBOTSTXT_OBEY": True,        # respect robots.txt
        "CONCURRENT_REQUESTS": 8,
        "FEEDS": {"products.jsonl": {"format": "jsonlines"}},
    }

    def parse(self, response):
        # Extract items from listing page
        for product_url in response.css("article.product a::attr(href)").getall():
            yield response.follow(product_url, self.parse_product)

        # Follow pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_product(self, response):
        yield {
            "name":  response.css("h1::text").get(default="").strip(),
            "price": response.css("p.price_color::text").get(default="").strip(),
            "rating": response.css("p.star-rating::attr(class)").get("").split()[-1],
            "url": response.url,
        }

# Run: scrapy runspider products_spider.py

6. Polite Scraping: Rate Limiting & Delays

  • Minimum 1–2 seconds between requests to the same domain. Scrapy's DOWNLOAD_DELAY setting.
  • Use AUTOTHROTTLE_ENABLED = True in Scrapy — automatically increases delays when the server is slow, reducing load during high-traffic periods.
  • Only crawl between business's low-traffic hours if possible (nights/weekends).
  • Set a meaningful User-Agent that identifies your bot and includes a contact email: MyScraper/1.0 (+https://mysite.com/bot; contact@mysite.com)
  • Respect Crawl-delay values in robots.txt if specified.

7. Anti-Bot Detection and Evasion

Modern anti-bot systems (Cloudflare, Akamai Bot Manager, DataDome) are sophisticated:

Detection SignalEvasion Approach
Missing browser fingerprint propertiesUse Playwright with playwright-stealth or undetected-playwright to patch fingerprint properties
Predictable request timingRandom delays + RANDOMIZE_DOWNLOAD_DELAY
Single IP making many requestsRotate residential proxies (Oxylabs, Bright Data, ScraperAPI)
No mouse movement / human behaviourPlaywright: simulate mouse movements and scroll events
Headless browser detectionundetected-chromium or Rebrowser; hide headless properties
TLS fingerprint mismatchUse curl-cffi — mimics Chrome's TLS handshake at the C library level

8. IP Rotation and Proxy Management

import httpx
import itertools

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]
proxy_cycle = itertools.cycle(PROXIES)

def get_with_rotation(url: str) -> httpx.Response:
    proxy = next(proxy_cycle)
    with httpx.Client(proxy=proxy, timeout=30) as client:
        resp = client.get(url)
        resp.raise_for_status()
        return resp

# For production: ScraperAPI or Bright Data handles rotation automatically
# httpx.get(f"https://api.scraperapi.com/?api_key={KEY}&url={url}")

9. Extracting Structured Data (JSON-LD & Microdata)

Many e-commerce and news sites embed structured data in their HTML that's far easier to parse than scraping CSS selectors:

import json
from bs4 import BeautifulSoup
import httpx

def extract_json_ld(url: str) -> list[dict]:
    html = httpx.get(url).text
    soup = BeautifulSoup(html, "lxml")
    schemas = []
    for tag in soup.find_all("script", type="application/ld+json"):
        try:
            schemas.append(json.loads(tag.string))
        except json.JSONDecodeError:
            pass
    return schemas

# Many product pages contain Product schema:
# {"@type": "Product", "name": "...", "offers": {"price": "24.99", ...}}

10. Storing and Processing Scraped Data

  • Small data (<1M rows): SQLite with Python's built-in sqlite3 or DuckDB (excellent for analytical queries on scraped data).
  • Medium data: PostgreSQL with psycopg3 or SQLAlchemy.
  • Large/unstructured: Parquet files + DuckDB for columnar analytics without an always-on database server.
  • Deduplication: Hash page content (SHA-256) or canonical URL to avoid storing the same page twice.
  • Change detection: Store a hash of each item; re-scrape and compare hashes to detect changes without storing full page history.

11. Frequently Asked Questions

Is web scraping legal?

In most countries, scraping publicly available data is legal. The key constraints are: don't violate CFAA (US) by bypassing access controls, respect GDPR/CCPA when personal data is involved, don't violate contract terms in a ToS you've accepted, and don't cause server harm. Always consult a lawyer for commercial scraping projects — the legal landscape is evolving rapidly.

What is the fastest Python scraping setup?

For static pages: httpx with asyncio (async concurrent requests) + lxml parser (fastest HTML parser in Python). For dynamic pages: Playwright with async API. For large crawls: Scrapy with concurrent requests and autothrottle. Combining Scrapy + Playwright via scrapy-playwright handles mixed static/dynamic crawls efficiently.

12. Glossary

BeautifulSoup
A Python library for parsing HTML and XML, navigating the parse tree with CSS selectors and DOM navigation.
Playwright
A browser automation library that controls headless Chrome/Firefox/WebKit for scraping JavaScript-rendered pages.
Scrapy
An asynchronous Python web crawling framework for large-scale scraping with built-in pipelines, middleware, and storage integration.
robots.txt
A file at the root of a website that specifies which parts crawlers are allowed or disallowed to access.
Anti-bot
Systems like Cloudflare Bot Management or DataDome that detect and block automated browser traffic.
JSON-LD
Linked Data embedded in <script type="application/ld+json"> tags, often containing structured product, article, or event data.

13. References & Further Reading

Start with a simple BeautifulSoup scraper on a static site you're allowed to scrape, like books.toscrape.com (built specifically for scraping practice). Extract all books and their prices to a CSV. Once that works, add Playwright for a JS-rendered site.