How to use Selenium with Scrapy

Many pages return only a shell of HTML in the first response, then inject the real cards, quotes, or product rows after the browser runs JavaScript. A normal Scrapy request only sees that first response, so selectors can stay empty even when the page looks complete in a browser.

Selenium can fit into Scrapy as a custom downloader middleware for only the requests that need a real browser. The middleware opens the page in Chrome, waits for a selector that proves the rendered DOM is ready, and returns the result as an HtmlResponse so the spider can keep using standard Scrapy CSS or XPath selectors and feed exports.

Current Scrapy releases use async def start() for custom start requests, and current Selenium releases handle browser driver setup with Selenium Manager, so a separate chromedriver install is usually unnecessary. Browser rendering is still much slower and heavier than plain HTTP crawling, so keep concurrency low when one browser session is shared and prefer replaying a page's underlying API request when that data is available.

Steps to use Selenium with Scrapy:

Install Selenium in the same Python environment as the Scrapy project.
```
$ python3 -m pip install selenium
```
Current Selenium releases handle browser driver setup with Selenium Manager. The first browser launch can take longer while it resolves or downloads the required browser and driver assets.

Create a new Scrapy project for the rendered spider.

$ scrapy startproject seleniumdemo
New Scrapy project 'seleniumdemo', created in:
    /home/user/seleniumdemo

Related: How to create a Scrapy project

Change to the project directory.
```
$ cd seleniumdemo
```

Replace seleniumdemo/middlewares.py with a downloader middleware that opens only Selenium-flagged requests in Chrome and returns the rendered DOM back to Scrapy.

seleniumdemo/middlewares.py

import logging
 
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
 
 
logger = logging.getLogger(__name__)
 
 
class SeleniumDownloaderMiddleware:
    def __init__(self, wait_seconds, driver_arguments):
        self.wait_seconds = wait_seconds
        self.driver_arguments = driver_arguments
        self.driver = None
 
    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            wait_seconds=crawler.settings.getint("SELENIUM_WAIT_SECONDS", 10),
            driver_arguments=crawler.settings.getlist("SELENIUM_DRIVER_ARGUMENTS"),
        )
        crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
        return middleware
 
    def spider_opened(self, spider):
        options = Options()
        for argument in self.driver_arguments:
            options.add_argument(argument)
        self.driver = webdriver.Chrome(options=options)
 
    def spider_closed(self, spider, reason):
        if self.driver is not None:
            self.driver.quit()
            self.driver = None
 
    def process_request(self, request):
        if not request.meta.get("selenium"):
            return None
 
        if self.driver is None:
            raise RuntimeError("Selenium WebDriver is not initialized.")
 
        self.driver.get(request.url)
 
        wait_css = request.meta.get("selenium_wait_css")
        if wait_css:
            try:
                WebDriverWait(self.driver, self.wait_seconds).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, wait_css))
                )
            except TimeoutException:
                logger.warning("Timed out waiting for selector: %s", wait_css)
 
        return HtmlResponse(
            url=self.driver.current_url,
            body=self.driver.page_source,
            encoding="utf-8",
            request=request,
        )

The custom request.meta["selenium"] flag keeps normal requests on Scrapy's regular downloader path, while selenium_wait_css delays parsing until the target selector exists in the rendered DOM.

Set the middleware, concurrency, and browser arguments in seleniumdemo/settings.py.

seleniumdemo/settings.py

BOT_NAME = "seleniumdemo"
 
SPIDER_MODULES = ["seleniumdemo.spiders"]
NEWSPIDER_MODULE = "seleniumdemo.spiders"
 
ROBOTSTXT_OBEY = True
 
CONCURRENT_REQUESTS = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 1
 
DOWNLOADER_MIDDLEWARES = {
    "seleniumdemo.middlewares.SeleniumDownloaderMiddleware": 800,
}
 
SELENIUM_WAIT_SECONDS = 10
SELENIUM_DRIVER_ARGUMENTS = [
    "--headless",
    "--window-size=1280,900",
]
 
FEED_EXPORT_ENCODING = "utf-8"

This example uses one shared browser session, so higher concurrency can serialize badly or mix page state unless separate browser instances or a driver pool are added.

Create seleniumdemo/spiders/rendered.py with a spider that sends only its start request through Selenium.

seleniumdemo/spiders/rendered.py

import scrapy
 
 
class RenderedSpider(scrapy.Spider):
    name = "rendered"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/js/"]
 
    async def start(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                dont_filter=True,
                meta={
                    "selenium": True,
                    "selenium_wait_css": ".quote",
                },
            )
 
    def parse(self, response):
        for quote in response.css(".quote .text::text").getall()[:3]:
            yield {"quote": quote}

Current Scrapy releases use async def start() for custom start requests. If the same spider must also run on releases older than 2.13, add a matching start_requests() method as a compatibility path.

Run the spider and export the rendered quotes to JSON.

$ scrapy crawl rendered -O items.json
Scraped from <200 https://quotes.toscrape.com/js/>
{'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
{'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}
Stored json feed (3 items) in: items.json

If the exported feed stays empty, the wait selector is not matching the post-render DOM or the better fix is to replay the page's underlying API request instead of rendering the full browser page.

Open the export file to confirm the browser-rendered quotes were written.

$ cat items.json
[
{"quote": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"},
{"quote": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"},
{"quote": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"}
]