How to use Selenium with Scrapy

Many pages return only a shell of HTML in the first response, then inject the real cards, quotes, or product rows after the browser runs JavaScript. A normal Scrapy request only sees that first response, so selectors can stay empty even when the page looks complete in a browser.

Selenium can fit into Scrapy as a custom downloader middleware for only the requests that need a real browser. The middleware opens the page in Chrome, waits for a selector that proves the rendered DOM is ready, and returns the result as an HtmlResponse so the spider can keep using standard Scrapy CSS or XPath selectors and feed exports.

Current Scrapy releases use async def start() for custom start requests, and current Selenium releases handle browser driver setup with Selenium Manager, so a separate chromedriver install is usually unnecessary. Browser rendering is still much slower and heavier than plain HTTP crawling, so keep concurrency low when one browser session is shared and prefer replaying a page's underlying API request when that data is available.

Steps to use Selenium with Scrapy:

  1. Install Selenium in the same Python environment as the Scrapy project.
    $ python3 -m pip install selenium

    Current Selenium releases handle browser driver setup with Selenium Manager. The first browser launch can take longer while it resolves or downloads the required browser and driver assets.

  2. Create a new Scrapy project for the rendered spider.
    $ scrapy startproject seleniumdemo
    New Scrapy project 'seleniumdemo', created in:
        /home/user/seleniumdemo
  3. Change to the project directory.
    $ cd seleniumdemo
  4. Replace seleniumdemo/middlewares.py with a downloader middleware that opens only Selenium-flagged requests in Chrome and returns the rendered DOM back to Scrapy.
    seleniumdemo/middlewares.py
    import logging
     
    from scrapy import signals
    from scrapy.http import HtmlResponse
    from selenium import webdriver
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
     
     
    logger = logging.getLogger(__name__)
     
     
    class SeleniumDownloaderMiddleware:
        def __init__(self, wait_seconds, driver_arguments):
            self.wait_seconds = wait_seconds
            self.driver_arguments = driver_arguments
            self.driver = None
     
        @classmethod
        def from_crawler(cls, crawler):
            middleware = cls(
                wait_seconds=crawler.settings.getint("SELENIUM_WAIT_SECONDS", 10),
                driver_arguments=crawler.settings.getlist("SELENIUM_DRIVER_ARGUMENTS"),
            )
            crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
            crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
            return middleware
     
        def spider_opened(self, spider):
            options = Options()
            for argument in self.driver_arguments:
                options.add_argument(argument)
            self.driver = webdriver.Chrome(options=options)
     
        def spider_closed(self, spider, reason):
            if self.driver is not None:
                self.driver.quit()
                self.driver = None
     
        def process_request(self, request):
            if not request.meta.get("selenium"):
                return None
     
            if self.driver is None:
                raise RuntimeError("Selenium WebDriver is not initialized.")
     
            self.driver.get(request.url)
     
            wait_css = request.meta.get("selenium_wait_css")
            if wait_css:
                try:
                    WebDriverWait(self.driver, self.wait_seconds).until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, wait_css))
                    )
                except TimeoutException:
                    logger.warning("Timed out waiting for selector: %s", wait_css)
     
            return HtmlResponse(
                url=self.driver.current_url,
                body=self.driver.page_source,
                encoding="utf-8",
                request=request,
            )

    The custom request.meta["selenium"] flag keeps normal requests on Scrapy's regular downloader path, while selenium_wait_css delays parsing until the target selector exists in the rendered DOM.

  5. Set the middleware, concurrency, and browser arguments in seleniumdemo/settings.py.
    seleniumdemo/settings.py
    BOT_NAME = "seleniumdemo"
     
    SPIDER_MODULES = ["seleniumdemo.spiders"]
    NEWSPIDER_MODULE = "seleniumdemo.spiders"
     
    ROBOTSTXT_OBEY = True
     
    CONCURRENT_REQUESTS = 1
    CONCURRENT_REQUESTS_PER_DOMAIN = 1
    DOWNLOAD_DELAY = 1
     
    DOWNLOADER_MIDDLEWARES = {
        "seleniumdemo.middlewares.SeleniumDownloaderMiddleware": 800,
    }
     
    SELENIUM_WAIT_SECONDS = 10
    SELENIUM_DRIVER_ARGUMENTS = [
        "--headless",
        "--window-size=1280,900",
    ]
     
    FEED_EXPORT_ENCODING = "utf-8"

    This example uses one shared browser session, so higher concurrency can serialize badly or mix page state unless separate browser instances or a driver pool are added.

  6. Create seleniumdemo/spiders/rendered.py with a spider that sends only its start request through Selenium.
    seleniumdemo/spiders/rendered.py
    import scrapy
     
     
    class RenderedSpider(scrapy.Spider):
        name = "rendered"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ["https://quotes.toscrape.com/js/"]
     
        async def start(self):
            for url in self.start_urls:
                yield scrapy.Request(
                    url,
                    dont_filter=True,
                    meta={
                        "selenium": True,
                        "selenium_wait_css": ".quote",
                    },
                )
     
        def parse(self, response):
            for quote in response.css(".quote .text::text").getall()[:3]:
                yield {"quote": quote}

    Current Scrapy releases use async def start() for custom start requests. If the same spider must also run on releases older than 2.13, add a matching start_requests() method as a compatibility path.

  7. Run the spider and export the rendered quotes to JSON.
    $ scrapy crawl rendered -O items.json
    Scraped from <200 https://quotes.toscrape.com/js/>
    {'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
    {'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    {'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}
    Stored json feed (3 items) in: items.json

    If the exported feed stays empty, the wait selector is not matching the post-render DOM or the better fix is to replay the page's underlying API request instead of rendering the full browser page.

  8. Open the export file to confirm the browser-rendered quotes were written.
    $ cat items.json
    [
    {"quote": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"},
    {"quote": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"},
    {"quote": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"}
    ]