1. When to Scrape vs Use an API
Scraping is often unnecessary. Before writing any scraper, check:
- Does the site have a public API? (Check
/api,/api-docs, developer documentation, or RapidAPI) - Is there a data export feature? Many sites provide CSV/JSON exports for their data.
- Is there a dataset already published? Kaggle, Hugging Face Datasets, data.gov, and Common Crawl often have what you need.
- Does the site's
robots.txtexplicitly prohibit crawling? (https://example.com/robots.txt)
If the answer to all of the above is "scraping is still necessary," proceed — but proceed responsibly.
2. Legal and Ethical Framework
Web scraping legality varies by jurisdiction and depends on what data is accessed and how. Key principles:
- Publicly available data: Generally permissible in most jurisdictions. The 2022 hiQ Labs v. LinkedIn ruling (9th Circuit) held that scraping publicly accessible data is not a CFAA violation in the US.
- Respect
robots.txt: While robots.txt is not legally binding in most countries, violating it is considered bad practice and can affect ToS litigation. Always read and respect it. - Terms of Service: Many sites prohibit scraping in their ToS. Violations can result in access termination and potentially civil litigation. Read the ToS before scraping.
- Personal data (GDPR/CCPA): If scraping includes personal data (names, emails, phone numbers), GDPR (EU) and CCPA (California) impose strict requirements — you may need a legitimate interest under GDPR Article 6(1)(f) and transparency obligations.
- Don't overload servers: Sending 1000 requests/second to a small website is effectively a DDoS attack. Rate limit aggressively.
3. BeautifulSoup: Static HTML Scraping
BeautifulSoup parses HTML into a navigable tree. Pair it with httpx (async-capable HTTP client) for efficient scraping of static pages:
import httpx
from bs4 import BeautifulSoup
import time
import random
def scrape_product_prices(urls: list[str]) -> list[dict]:
results = []
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
with httpx.Client(headers=headers, follow_redirects=True, timeout=30) as client:
for url in urls:
resp = client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml") # lxml is faster than html.parser
# Select with CSS selectors
name = soup.select_one("h1.product-title")
price = soup.select_one("span[data-price]")
results.append({
"url": url,
"name": name.get_text(strip=True) if name else None,
"price": price["data-price"] if price else None,
})
# Polite delay: 1–3 seconds between requests
time.sleep(random.uniform(1.0, 3.0))
return results
4. Playwright: JavaScript-Rendered Pages
About 70% of modern websites use client-side JavaScript rendering. httpx + BeautifulSoup gets raw HTML before JS runs — useless for SPAs and dynamic content. Playwright controls a real browser (Chromium, Firefox, WebKit):
import asyncio
from playwright.async_api import async_playwright
async def scrape_dynamic_page(url: str) -> dict:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True, # set False to watch browser during development
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
viewport={"width": 1920, "height": 1080},
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle") # wait for JS to settle
# Wait for specific element to be visible (more reliable than networkidle)
await page.wait_for_selector(".product-grid", timeout=10000)
# Extract content after JS rendering
products = await page.evaluate("""
() => Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.name')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}))
""")
await browser.close()
return {"url": url, "products": products}
# Run
result = asyncio.run(scrape_dynamic_page("https://example.com/shop"))
Playwright vs Selenium: Playwright is the modern choice — faster, better async support, more reliable auto-waiting, and supports multiple browser engines. Selenium is still widely used for legacy test suites but Playwright is preferred for new scraping projects.
5. Scrapy: Large-Scale Crawling
For crawling thousands or millions of pages, Scrapy's asynchronous architecture (built on Twisted) outperforms sequential requests/httpx:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/catalogue/page1.html"]
custom_settings = {
"DOWNLOAD_DELAY": 2, # seconds between requests
"RANDOMIZE_DOWNLOAD_DELAY": True, # randomize by ±50%
"AUTOTHROTTLE_ENABLED": True, # adaptive throttling based on server response
"ROBOTSTXT_OBEY": True, # respect robots.txt
"CONCURRENT_REQUESTS": 8,
"FEEDS": {"products.jsonl": {"format": "jsonlines"}},
}
def parse(self, response):
# Extract items from listing page
for product_url in response.css("article.product a::attr(href)").getall():
yield response.follow(product_url, self.parse_product)
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_product(self, response):
yield {
"name": response.css("h1::text").get(default="").strip(),
"price": response.css("p.price_color::text").get(default="").strip(),
"rating": response.css("p.star-rating::attr(class)").get("").split()[-1],
"url": response.url,
}
# Run: scrapy runspider products_spider.py
6. Polite Scraping: Rate Limiting & Delays
- Minimum 1–2 seconds between requests to the same domain. Scrapy's
DOWNLOAD_DELAYsetting. - Use
AUTOTHROTTLE_ENABLED = Truein Scrapy — automatically increases delays when the server is slow, reducing load during high-traffic periods. - Only crawl between business's low-traffic hours if possible (nights/weekends).
- Set a meaningful
User-Agentthat identifies your bot and includes a contact email:MyScraper/1.0 (+https://mysite.com/bot; contact@mysite.com) - Respect
Crawl-delayvalues inrobots.txtif specified.
7. Anti-Bot Detection and Evasion
Modern anti-bot systems (Cloudflare, Akamai Bot Manager, DataDome) are sophisticated:
| Detection Signal | Evasion Approach |
|---|---|
| Missing browser fingerprint properties | Use Playwright with playwright-stealth or undetected-playwright to patch fingerprint properties |
| Predictable request timing | Random delays + RANDOMIZE_DOWNLOAD_DELAY |
| Single IP making many requests | Rotate residential proxies (Oxylabs, Bright Data, ScraperAPI) |
| No mouse movement / human behaviour | Playwright: simulate mouse movements and scroll events |
| Headless browser detection | undetected-chromium or Rebrowser; hide headless properties |
| TLS fingerprint mismatch | Use curl-cffi — mimics Chrome's TLS handshake at the C library level |
8. IP Rotation and Proxy Management
import httpx
import itertools
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
proxy_cycle = itertools.cycle(PROXIES)
def get_with_rotation(url: str) -> httpx.Response:
proxy = next(proxy_cycle)
with httpx.Client(proxy=proxy, timeout=30) as client:
resp = client.get(url)
resp.raise_for_status()
return resp
# For production: ScraperAPI or Bright Data handles rotation automatically
# httpx.get(f"https://api.scraperapi.com/?api_key={KEY}&url={url}")
9. Extracting Structured Data (JSON-LD & Microdata)
Many e-commerce and news sites embed structured data in their HTML that's far easier to parse than scraping CSS selectors:
import json
from bs4 import BeautifulSoup
import httpx
def extract_json_ld(url: str) -> list[dict]:
html = httpx.get(url).text
soup = BeautifulSoup(html, "lxml")
schemas = []
for tag in soup.find_all("script", type="application/ld+json"):
try:
schemas.append(json.loads(tag.string))
except json.JSONDecodeError:
pass
return schemas
# Many product pages contain Product schema:
# {"@type": "Product", "name": "...", "offers": {"price": "24.99", ...}}
10. Storing and Processing Scraped Data
- Small data (<1M rows): SQLite with Python's built-in
sqlite3or DuckDB (excellent for analytical queries on scraped data). - Medium data: PostgreSQL with
psycopg3or SQLAlchemy. - Large/unstructured: Parquet files + DuckDB for columnar analytics without an always-on database server.
- Deduplication: Hash page content (SHA-256) or canonical URL to avoid storing the same page twice.
- Change detection: Store a hash of each item; re-scrape and compare hashes to detect changes without storing full page history.
11. Frequently Asked Questions
Is web scraping legal?
In most countries, scraping publicly available data is legal. The key constraints are: don't violate CFAA (US) by bypassing access controls, respect GDPR/CCPA when personal data is involved, don't violate contract terms in a ToS you've accepted, and don't cause server harm. Always consult a lawyer for commercial scraping projects — the legal landscape is evolving rapidly.
What is the fastest Python scraping setup?
For static pages: httpx with asyncio (async concurrent requests) + lxml parser (fastest HTML parser in Python). For dynamic pages: Playwright with async API. For large crawls: Scrapy with concurrent requests and autothrottle. Combining Scrapy + Playwright via scrapy-playwright handles mixed static/dynamic crawls efficiently.
12. Glossary
- BeautifulSoup
- A Python library for parsing HTML and XML, navigating the parse tree with CSS selectors and DOM navigation.
- Playwright
- A browser automation library that controls headless Chrome/Firefox/WebKit for scraping JavaScript-rendered pages.
- Scrapy
- An asynchronous Python web crawling framework for large-scale scraping with built-in pipelines, middleware, and storage integration.
- robots.txt
- A file at the root of a website that specifies which parts crawlers are allowed or disallowed to access.
- Anti-bot
- Systems like Cloudflare Bot Management or DataDome that detect and block automated browser traffic.
- JSON-LD
- Linked Data embedded in
<script type="application/ld+json">tags, often containing structured product, article, or event data.
13. References & Further Reading
- Scrapy Documentation
- Playwright for Python Documentation
- BeautifulSoup Documentation
- httpx — Async Python HTTP Client
- scrapy-playwright — Playwright integration for Scrapy
Start with a simple BeautifulSoup scraper on a static site you're allowed to scrape, like books.toscrape.com (built specifically for scraping practice). Extract all books and their prices to a CSV. Once that works, add Playwright for a JS-rendered site.