Many modern websites render key content with JavaScript after the initial HTML loads, which can leave a Scrapy spider with empty selectors and missing fields. Scraping the post-rendered DOM enables extraction of the same text and attributes that appear in a normal browser.

Scrapy processes responses through downloader middleware before the spider parses the HTML. Using Splash routes selected requests through a lightweight browser rendering service and returns the rendered HTML to Scrapy, allowing standard CSS or XPath selectors to work without rewriting parsing logic.

JavaScript rendering is slower and more resource-intensive than fetching raw HTML, so rate limiting and concurrency settings should be conservative and target sites' robots/terms should be respected. Splash is best for pages that render correctly with its engine; sites that rely on modern Chromium features or aggressive bot protection may require a different renderer.

Steps to scrape a JavaScript-rendered page with Scrapy using Splash:

  1. Start the Splash renderer in a local Docker container.
    $ docker run --platform linux/amd64 --detach --name splash --network=container:sg-scrapy-verify scrapinghub/splash
    f5e2c7b4507492024e71c134d472201ed64ddf3920e5266d7ed83d15180834fa

  2. Verify the Splash API responds on http://localhost:8050.
    $ curl -s http://localhost:8050/_ping
    {"maxrss": 249331712, "status": "ok"}

  3. Preview the rendered HTML returned by Splash for the target page.
    $ curl -s 'http://localhost:8050/render.html?url=http://app.internal.example:8000/scroll/&wait=2&cache=0'
    <!DOCTYPE html><html lang="en"><head>
      <meta charset="utf-8">
      <title>Scroll Feed</title>
    </head>
    <body>
      <h1>Scroll Feed</h1>
      <ul id="items"><li>Scroll Item 1</li><li>Scroll Item 2</li><li>Scroll Item 3</li></ul>
    ##### snipped #####

    Use wait to give client-side scripts time to populate the DOM.

  4. Install scrapy-splash in the Scrapy project's Python environment.
    $ python -m pip install scrapy-splash
    Collecting scrapy-splash
      Downloading scrapy_splash-0.11.1-py2.py3-none-any.whl.metadata (35 kB)
    ##### snipped #####
    Successfully installed attrs-25.4.0 automat-25.4.16 certifi-2025.11.12 cffi-2.0.0 charset_normalizer-3.4.4 constantly-23.10.4 cryptography-46.0.3 cssselect-1.3.0 defusedxml-0.7.1 filelock-3.20.1 hyperlink-21.0.0 idna-3.11 incremental-24.11.0 itemadapter-0.13.0 itemloaders-1.3.2 jmespath-1.0.1 lxml-6.0.2 packaging-25.0 parsel-1.10.0 protego-0.5.0 pyasn1-0.6.1 pyasn1-modules-0.4.2 pycparser-2.23 pydispatcher-2.0.7 pyopenssl-25.3.0 queuelib-1.8.0 requests-2.32.5 requests-file-3.0.1 scrapy-2.13.4 scrapy-splash-0.11.1 service-identity-24.2.0 six-1.17.0 tldextract-5.3.1 twisted-25.5.0 typing-extensions-4.15.0 urllib3-2.6.2 w3lib-2.3.1 zope-interface-8.1.1
  5. Enable Splash middleware, caching, duplicate filtering in settings.py.
    SPLASH_URL = "http://localhost:8050"
     
    DOWNLOADER_MIDDLEWARES = {
        "scrapy_splash.SplashCookiesMiddleware": 723,
        "scrapy_splash.SplashMiddleware": 725,
        "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
    }
     
    SPIDER_MIDDLEWARES = {
        "scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
    }
     
    DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
    HTTPCACHE_STORAGE = "scrapy_splash.SplashAwareFSCacheStorage"
     
    ROBOTSTXT_OBEY = True

    Set SPLASH_URL to the reachable Splash address, which is not localhost when Scrapy runs in a separate container.

  6. Create a spider that uses SplashRequest for pages that require rendering.
    import scrapy
    from scrapy_splash import SplashRequest
     
     
    class ScrollJsSpider(scrapy.Spider):
        name = "scroll_js"
        start_urls = ["http://app.internal.example:8000/scroll/"]
     
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(
                    url=url,
                    callback=self.parse,
                    endpoint="render.html",
                    args={"wait": 2},
                )
     
        def parse(self, response):
            for entry in response.css("#items li"):
                yield {"title": entry.css("::text").get()}

    Keep regular scrapy.Request for non-JavaScript pages to reduce rendering load.

  7. Run the spider with JSON feed export enabled.
    $ scrapy crawl scroll_js -O items.json
    2026-01-01 12:31:46 [scrapy.utils.log] INFO: Scrapy 2.13.4 started (bot: splash_demo)
    2026-01-01 12:31:46 [scrapy.core.engine] INFO: Spider opened
    2026-01-01 12:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/scroll/ via http://localhost:8050/render.html> (referer: None)
    2026-01-01 12:31:49 [scrapy.core.engine] INFO: Closing spider (finished)
    2026-01-01 12:31:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/response_status_count/200': 2,
     'item_scraped_count': 3,
     'finish_reason': 'finished'}
    ##### snipped #####

    Rendering every request can overwhelm both the renderer and the target site, so reduce concurrency and add delays when crawling beyond a single page.

  8. Inspect the exported file to confirm items are present.
    $ python -m json.tool items.json
    [
        {
            "title": "Scroll Item 1"
        },
        {
            "title": "Scroll Item 2"
        },
        {
            "title": "Scroll Item 3"
        }
    ]

  9. Remove the Splash container when rendering is no longer required.
    $ docker rm -f splash
    splash