Scraping modern websites often fails when the initial HTML is only a placeholder and the real content is rendered by JavaScript after the page loads. Combining Scrapy with Selenium makes it possible to extract data from pages that require a real browser session for rendering, navigation, or interaction.
Scrapy excels at scheduling requests, following links, and parsing responses at high speed, but it does not execute JavaScript. Selenium drives a real browser (Chrome, Firefox, Edge), waits for dynamic elements to appear, then exposes the final DOM as HTML so Scrapy selectors (CSS/XPath) can extract items normally.
A browser-based renderer is slower and significantly more resource intensive than plain HTTP fetching, and it can block Scrapy concurrency when used in the same process. A compatible browser and driver must be available for the chosen Selenium WebDriver, and deprecated headless options like PhantomJS are not recommended. Rate limiting and respecting robots.txt and site terms remain important because a browser can generate additional background requests beyond the primary page load.
Steps to use Selenium with Scrapy for web scraping:
- Start a Selenium standalone Chromium container.
$ docker run --detach --name selenium --network=container:sg-scrapy-verify --shm-size=2g seleniarm/standalone-chromium 20fc7b23e8a38687604e949ed12f52473aaa559608394b439ce8959e175801ea
- Confirm the Selenium Grid is ready.
$ curl -s http://localhost:4444/wd/hub/status { "value": { "ready": true, "message": "Selenium Grid ready.", ##### snipped ##### } }
- Install Scrapy and Selenium in the project environment.
$ python -m pip install scrapy selenium Collecting scrapy Using cached scrapy-2.13.4-py3-none-any.whl.metadata (4.4 kB) Collecting selenium Downloading selenium-4.39.0-py3-none-any.whl.metadata (7.5 kB) ##### snipped ##### Successfully installed attrs-25.4.0 automat-25.4.16 certifi-2025.11.12 cffi-2.0.0 charset_normalizer-3.4.4 constantly-23.10.4 cryptography-46.0.3 cssselect-1.3.0 defusedxml-0.7.1 filelock-3.20.1 h11-0.16.0 hyperlink-21.0.0 idna-3.11 incremental-24.11.0 itemadapter-0.13.0 itemloaders-1.3.2 jmespath-1.0.1 lxml-6.0.2 outcome-1.3.0.post0 packaging-25.0 parsel-1.10.0 protego-0.5.0 pyasn1-0.6.1 pyasn1-modules-0.4.2 pycparser-2.23 pydispatcher-2.0.7 pyopenssl-25.3.0 pysocks-1.7.1 queuelib-1.8.0 requests-2.32.5 requests-file-3.0.1 scrapy-2.13.4 selenium-4.39.0 service-identity-24.2.0 sniffio-1.3.1 sortedcontainers-2.4.0 tldextract-5.3.1 trio-0.32.0 trio-websocket-0.12.2 twisted-25.5.0 typing_extensions-4.15.0 urllib3-2.6.2 w3lib-2.3.1 websocket-client-1.9.0 wsproto-1.3.2 zope-interface-8.1.1
- Create a new Scrapy project.
$ scrapy startproject scrapy_selenium_demo New Scrapy project 'scrapy_selenium_demo', using template directory '/root/sg-work/selenium-venv/lib/python3.12/site-packages/scrapy/templates/project', created in: /root/sg-work/scrapy_selenium_demo ##### snipped ##### - Generate a spider skeleton for a JavaScript-rendered target.
$ cd scrapy_selenium_demo $ scrapy genspider scroll_js app.internal.example Created spider 'scroll_js' using template 'basic' in module: scrapy_selenium_demo.spiders.scroll_js
- Create a Selenium downloader middleware module at scrapy_selenium_demo/scrapy_selenium_demo/selenium_middleware.py.
- scrapy_selenium_demo/scrapy_selenium_demo/selenium_middleware.py
from __future__ import annotations from typing import Optional from scrapy import signals from scrapy.http import HtmlResponse, Request from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait class SeleniumMiddleware: def __init__(self, timeout: int, headless: bool, command_executor: str): self.timeout = timeout self.headless = headless self.command_executor = command_executor self.driver: Optional[webdriver.Remote] = None @classmethod def from_crawler(cls, crawler): timeout = crawler.settings.getint("SELENIUM_TIMEOUT", 20) headless = crawler.settings.getbool("SELENIUM_HEADLESS", True) command_executor = crawler.settings.get( "SELENIUM_COMMAND_EXECUTOR", "http://localhost:4444/wd/hub" ) middleware = cls(timeout=timeout, headless=headless, command_executor=command_executor) crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened) crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed) return middleware def spider_opened(self, spider): options = Options() if self.headless: options.add_argument("--headless") options.add_argument("--window-size=1920,1080") options.add_argument("--no-sandbox") options.add_argument("--disable-dev-shm-usage") options.add_argument("--disable-gpu") self.driver = webdriver.Remote( command_executor=self.command_executor, options=options, ) def spider_closed(self, spider, reason): if self.driver: self.driver.quit() self.driver = None def process_request(self, request: Request, spider): if not request.meta.get("selenium"): return None if not self.driver: raise RuntimeError("Selenium WebDriver is not initialized.") self.driver.get(request.url) wait_css = request.meta.get("selenium_wait_css") if wait_css: try: WebDriverWait(self.driver, self.timeout).until( EC.presence_of_element_located((By.CSS_SELECTOR, wait_css)) ) except TimeoutException: spider.logger.warning("Timed out waiting for selector: %s", wait_css) html = self.driver.page_source return HtmlResponse( url=self.driver.current_url, body=html.encode("utf-8"), encoding="utf-8", request=request, )
request.meta key selenium enables browser rendering, and selenium_wait_css waits for a CSS selector before handing the HTML back to Scrapy.
- Configure the Selenium middleware in scrapy_selenium_demo/scrapy_selenium_demo/settings.py.
- scrapy_selenium_demo/scrapy_selenium_demo/settings.py
DOWNLOADER_MIDDLEWARES = { "scrapy_selenium_demo.selenium_middleware.SeleniumMiddleware": 800, } SELENIUM_COMMAND_EXECUTOR = "http://localhost:4444/wd/hub" SELENIUM_HEADLESS = True SELENIUM_TIMEOUT = 20 CONCURRENT_REQUESTS = 1 CONCURRENT_REQUESTS_PER_DOMAIN = 1 DOWNLOAD_DELAY = 1.0 ROBOTSTXT_OBEY = True
A single shared WebDriver instance is used in this pattern, so concurrency should remain low unless a driver pool is implemented.
- Replace scrapy_selenium_demo/scrapy_selenium_demo/spiders/scroll_js.py with a spider that flags Selenium-rendered requests.
- scrapy_selenium_demo/scrapy_selenium_demo/spiders/scroll_js.py
import scrapy class ScrollJsSpider(scrapy.Spider): name = "scroll_js" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/scroll/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url=url, meta={"selenium": True, "selenium_wait_css": "#items li"}, ) def parse(self, response): for entry in response.css("#items li"): yield {"title": entry.css("::text").get()}
Adjust selenium_wait_css to a stable element that appears only after JavaScript rendering completes.
- Run the spider with feed export to write items to items.json.
$ scrapy crawl scroll_js -O items.json 2026-01-01 20:06:09 [scrapy.utils.log] INFO: Scrapy 2.13.4 started (bot: scrapy_selenium_demo) 2026-01-01 20:06:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/scroll/> (referer: None) 2026-01-01 20:06:10 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: items.json 2026-01-01 20:06:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/response_status_count/200': 2, 'item_scraped_count': 3, 'finish_reason': 'finished'} ##### snipped #####Browser-driven scraping can trigger additional background requests and heavier load; keep delays conservative and follow site terms and access policies.
- Inspect the exported file for rendered content.
$ python -m json.tool items.json [ { "title": "Scroll Item 1" }, { "title": "Scroll Item 2" }, { "title": "Scroll Item 3" } ] - Remove the Selenium container when you're done.
$ docker rm -f selenium selenium
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
