Scraping modern websites often fails when the initial HTML is only a placeholder and the real content is rendered by JavaScript after the page loads. Combining Scrapy with Selenium makes it possible to extract data from pages that require a real browser session for rendering, navigation, or interaction.

Scrapy excels at scheduling requests, following links, and parsing responses at high speed, but it does not execute JavaScript. Selenium drives a real browser (Chrome, Firefox, Edge), waits for dynamic elements to appear, then exposes the final DOM as HTML so Scrapy selectors (CSS/XPath) can extract items normally.

A browser-based renderer is slower and significantly more resource intensive than plain HTTP fetching, and it can block Scrapy concurrency when used in the same process. A compatible browser and driver must be available for the chosen Selenium WebDriver, and deprecated headless options like PhantomJS are not recommended. Rate limiting and respecting robots.txt and site terms remain important because a browser can generate additional background requests beyond the primary page load.

Steps to use Selenium with Scrapy for web scraping:

  1. Start a Selenium standalone Chromium container.
    $ docker run --detach --name selenium --network=container:sg-scrapy-verify --shm-size=2g seleniarm/standalone-chromium
    20fc7b23e8a38687604e949ed12f52473aaa559608394b439ce8959e175801ea
  2. Confirm the Selenium Grid is ready.
    $ curl -s http://localhost:4444/wd/hub/status
    {
      "value": {
        "ready": true,
        "message": "Selenium Grid ready.",
    ##### snipped #####
      }
    }
  3. Install Scrapy and Selenium in the project environment.
    $ python -m pip install scrapy selenium
    Collecting scrapy
      Using cached scrapy-2.13.4-py3-none-any.whl.metadata (4.4 kB)
    Collecting selenium
      Downloading selenium-4.39.0-py3-none-any.whl.metadata (7.5 kB)
    ##### snipped #####
    Successfully installed attrs-25.4.0 automat-25.4.16 certifi-2025.11.12 cffi-2.0.0 charset_normalizer-3.4.4 constantly-23.10.4 cryptography-46.0.3 cssselect-1.3.0 defusedxml-0.7.1 filelock-3.20.1 h11-0.16.0 hyperlink-21.0.0 idna-3.11 incremental-24.11.0 itemadapter-0.13.0 itemloaders-1.3.2 jmespath-1.0.1 lxml-6.0.2 outcome-1.3.0.post0 packaging-25.0 parsel-1.10.0 protego-0.5.0 pyasn1-0.6.1 pyasn1-modules-0.4.2 pycparser-2.23 pydispatcher-2.0.7 pyopenssl-25.3.0 pysocks-1.7.1 queuelib-1.8.0 requests-2.32.5 requests-file-3.0.1 scrapy-2.13.4 selenium-4.39.0 service-identity-24.2.0 sniffio-1.3.1 sortedcontainers-2.4.0 tldextract-5.3.1 trio-0.32.0 trio-websocket-0.12.2 twisted-25.5.0 typing_extensions-4.15.0 urllib3-2.6.2 w3lib-2.3.1 websocket-client-1.9.0 wsproto-1.3.2 zope-interface-8.1.1
  4. Create a new Scrapy project.
    $ scrapy startproject scrapy_selenium_demo
    New Scrapy project 'scrapy_selenium_demo', using template directory '/root/sg-work/selenium-venv/lib/python3.12/site-packages/scrapy/templates/project', created in:
        /root/sg-work/scrapy_selenium_demo
    ##### snipped #####
  5. Generate a spider skeleton for a JavaScript-rendered target.
    $ cd scrapy_selenium_demo
    $ scrapy genspider scroll_js app.internal.example
    Created spider 'scroll_js' using template 'basic' in module:
      scrapy_selenium_demo.spiders.scroll_js
  6. Create a Selenium downloader middleware module at scrapy_selenium_demo/scrapy_selenium_demo/selenium_middleware.py.
    scrapy_selenium_demo/scrapy_selenium_demo/selenium_middleware.py
    from __future__ import annotations
     
    from typing import Optional
     
    from scrapy import signals
    from scrapy.http import HtmlResponse, Request
    from selenium import webdriver
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
     
     
    class SeleniumMiddleware:
        def __init__(self, timeout: int, headless: bool, command_executor: str):
            self.timeout = timeout
            self.headless = headless
            self.command_executor = command_executor
            self.driver: Optional[webdriver.Remote] = None
     
        @classmethod
        def from_crawler(cls, crawler):
            timeout = crawler.settings.getint("SELENIUM_TIMEOUT", 20)
            headless = crawler.settings.getbool("SELENIUM_HEADLESS", True)
            command_executor = crawler.settings.get(
                "SELENIUM_COMMAND_EXECUTOR", "http://localhost:4444/wd/hub"
            )
     
            middleware = cls(timeout=timeout, headless=headless, command_executor=command_executor)
            crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
            crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
            return middleware
     
        def spider_opened(self, spider):
            options = Options()
            if self.headless:
                options.add_argument("--headless")
            options.add_argument("--window-size=1920,1080")
            options.add_argument("--no-sandbox")
            options.add_argument("--disable-dev-shm-usage")
            options.add_argument("--disable-gpu")
     
            self.driver = webdriver.Remote(
                command_executor=self.command_executor,
                options=options,
            )
     
        def spider_closed(self, spider, reason):
            if self.driver:
                self.driver.quit()
                self.driver = None
     
        def process_request(self, request: Request, spider):
            if not request.meta.get("selenium"):
                return None
     
            if not self.driver:
                raise RuntimeError("Selenium WebDriver is not initialized.")
     
            self.driver.get(request.url)
     
            wait_css = request.meta.get("selenium_wait_css")
            if wait_css:
                try:
                    WebDriverWait(self.driver, self.timeout).until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, wait_css))
                    )
                except TimeoutException:
                    spider.logger.warning("Timed out waiting for selector: %s", wait_css)
     
            html = self.driver.page_source
            return HtmlResponse(
                url=self.driver.current_url,
                body=html.encode("utf-8"),
                encoding="utf-8",
                request=request,
            )

    request.meta key selenium enables browser rendering, and selenium_wait_css waits for a CSS selector before handing the HTML back to Scrapy.

  7. Configure the Selenium middleware in scrapy_selenium_demo/scrapy_selenium_demo/settings.py.
    scrapy_selenium_demo/scrapy_selenium_demo/settings.py
    DOWNLOADER_MIDDLEWARES = {
        "scrapy_selenium_demo.selenium_middleware.SeleniumMiddleware": 800,
    }
     
    SELENIUM_COMMAND_EXECUTOR = "http://localhost:4444/wd/hub"
    SELENIUM_HEADLESS = True
    SELENIUM_TIMEOUT = 20
     
    CONCURRENT_REQUESTS = 1
    CONCURRENT_REQUESTS_PER_DOMAIN = 1
    DOWNLOAD_DELAY = 1.0
    ROBOTSTXT_OBEY = True

    A single shared WebDriver instance is used in this pattern, so concurrency should remain low unless a driver pool is implemented.

  8. Replace scrapy_selenium_demo/scrapy_selenium_demo/spiders/scroll_js.py with a spider that flags Selenium-rendered requests.
    scrapy_selenium_demo/scrapy_selenium_demo/spiders/scroll_js.py
    import scrapy
     
     
    class ScrollJsSpider(scrapy.Spider):
        name = "scroll_js"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/scroll/"]
     
        def start_requests(self):
            for url in self.start_urls:
                yield scrapy.Request(
                    url=url,
                    meta={"selenium": True, "selenium_wait_css": "#items li"},
                )
     
        def parse(self, response):
            for entry in response.css("#items li"):
                yield {"title": entry.css("::text").get()}

    Adjust selenium_wait_css to a stable element that appears only after JavaScript rendering completes.

  9. Run the spider with feed export to write items to items.json.
    $ scrapy crawl scroll_js -O items.json
    2026-01-01 20:06:09 [scrapy.utils.log] INFO: Scrapy 2.13.4 started (bot: scrapy_selenium_demo)
    2026-01-01 20:06:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/scroll/> (referer: None)
    2026-01-01 20:06:10 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: items.json
    2026-01-01 20:06:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/response_status_count/200': 2,
     'item_scraped_count': 3,
     'finish_reason': 'finished'}
    ##### snipped #####

    Browser-driven scraping can trigger additional background requests and heavier load; keep delays conservative and follow site terms and access policies.

  10. Inspect the exported file for rendered content.
    $ python -m json.tool items.json
    [
        {
            "title": "Scroll Item 1"
        },
        {
            "title": "Scroll Item 2"
        },
        {
            "title": "Scroll Item 3"
        }
    ]
  11. Remove the Selenium container when you're done.
    $ docker rm -f selenium
    selenium