Many modern websites render key content with JavaScript after the initial HTML loads, which can leave a Scrapy spider with empty selectors and missing fields. Scraping the post-rendered DOM enables extraction of the same text and attributes that appear in a normal browser.
Scrapy processes responses through downloader middleware before the spider parses the HTML. Using Splash routes selected requests through a lightweight browser rendering service and returns the rendered HTML to Scrapy, allowing standard CSS or XPath selectors to work without rewriting parsing logic.
JavaScript rendering is slower and more resource-intensive than fetching raw HTML, so rate limiting and concurrency settings should be conservative and target sites' robots/terms should be respected. Splash is best for pages that render correctly with its engine; sites that rely on modern Chromium features or aggressive bot protection may require a different renderer.
Steps to scrape a JavaScript-rendered page with Scrapy using Splash:
- Start the Splash renderer in a local Docker container.
$ docker run --platform linux/amd64 --detach --name splash --network=container:sg-scrapy-verify scrapinghub/splash f5e2c7b4507492024e71c134d472201ed64ddf3920e5266d7ed83d15180834fa
- Verify the Splash API responds on http://localhost:8050.
$ curl -s http://localhost:8050/_ping {"maxrss": 249331712, "status": "ok"}
- Preview the rendered HTML returned by Splash for the target page.
$ curl -s 'http://localhost:8050/render.html?url=http://app.internal.example:8000/scroll/&wait=2&cache=0' <!DOCTYPE html><html lang="en"><head> <meta charset="utf-8"> <title>Scroll Feed</title> </head> <body> <h1>Scroll Feed</h1> <ul id="items"><li>Scroll Item 1</li><li>Scroll Item 2</li><li>Scroll Item 3</li></ul> ##### snipped #####
Use wait to give client-side scripts time to populate the DOM.
- Install scrapy-splash in the Scrapy project's Python environment.
$ python -m pip install scrapy-splash Collecting scrapy-splash Downloading scrapy_splash-0.11.1-py2.py3-none-any.whl.metadata (35 kB) ##### snipped ##### Successfully installed attrs-25.4.0 automat-25.4.16 certifi-2025.11.12 cffi-2.0.0 charset_normalizer-3.4.4 constantly-23.10.4 cryptography-46.0.3 cssselect-1.3.0 defusedxml-0.7.1 filelock-3.20.1 hyperlink-21.0.0 idna-3.11 incremental-24.11.0 itemadapter-0.13.0 itemloaders-1.3.2 jmespath-1.0.1 lxml-6.0.2 packaging-25.0 parsel-1.10.0 protego-0.5.0 pyasn1-0.6.1 pyasn1-modules-0.4.2 pycparser-2.23 pydispatcher-2.0.7 pyopenssl-25.3.0 queuelib-1.8.0 requests-2.32.5 requests-file-3.0.1 scrapy-2.13.4 scrapy-splash-0.11.1 service-identity-24.2.0 six-1.17.0 tldextract-5.3.1 twisted-25.5.0 typing-extensions-4.15.0 urllib3-2.6.2 w3lib-2.3.1 zope-interface-8.1.1
- Enable Splash middleware, caching, duplicate filtering in settings.py.
SPLASH_URL = "http://localhost:8050" DOWNLOADER_MIDDLEWARES = { "scrapy_splash.SplashCookiesMiddleware": 723, "scrapy_splash.SplashMiddleware": 725, "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810, } SPIDER_MIDDLEWARES = { "scrapy_splash.SplashDeduplicateArgsMiddleware": 100, } DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter" HTTPCACHE_STORAGE = "scrapy_splash.SplashAwareFSCacheStorage" ROBOTSTXT_OBEY = True
Set SPLASH_URL to the reachable Splash address, which is not localhost when Scrapy runs in a separate container.
- Create a spider that uses SplashRequest for pages that require rendering.
import scrapy from scrapy_splash import SplashRequest class ScrollJsSpider(scrapy.Spider): name = "scroll_js" start_urls = ["http://app.internal.example:8000/scroll/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint="render.html", args={"wait": 2}, ) def parse(self, response): for entry in response.css("#items li"): yield {"title": entry.css("::text").get()}
Keep regular scrapy.Request for non-JavaScript pages to reduce rendering load.
- Run the spider with JSON feed export enabled.
$ scrapy crawl scroll_js -O items.json 2026-01-01 12:31:46 [scrapy.utils.log] INFO: Scrapy 2.13.4 started (bot: splash_demo) 2026-01-01 12:31:46 [scrapy.core.engine] INFO: Spider opened 2026-01-01 12:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/scroll/ via http://localhost:8050/render.html> (referer: None) 2026-01-01 12:31:49 [scrapy.core.engine] INFO: Closing spider (finished) 2026-01-01 12:31:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/response_status_count/200': 2, 'item_scraped_count': 3, 'finish_reason': 'finished'} ##### snipped #####
Rendering every request can overwhelm both the renderer and the target site, so reduce concurrency and add delays when crawling beyond a single page.
- Inspect the exported file to confirm items are present.
$ python -m json.tool items.json [ { "title": "Scroll Item 1" }, { "title": "Scroll Item 2" }, { "title": "Scroll Item 3" } ]
- Remove the Splash container when rendering is no longer required.
$ docker rm -f splash splash
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
