Infinite scrolling pages usually expose only the first batch of records in the initial HTML response. Replaying the follow-up requests is the reliable way to collect full product lists, timelines, or search results instead of silently missing everything loaded after the first scroll.
Most sites implement infinite scroll by calling a JSON or HTML fragment endpoint through background Fetch or XHR requests. Scrapy works best when it targets that endpoint directly, reproduces the same request shape, parses the returned payload with response.json() or selectors, and keeps requesting the next batch until the API stops returning a cursor, page number, or offset.
Current Scrapy releases use start() for initial requests, and Request.from_curl() is a practical way to turn a copied browser request into a working spider. If the endpoint depends on short-lived browser tokens, heavy fingerprinting, or rendered DOM state that cannot be reproduced as HTTP requests, move to a browser-rendering workflow instead of forcing a pure Scrapy crawl.
Steps to scrape an infinite scrolling page with Scrapy:
- Open the target page in a web browser.
- Open the browser developer tools and select the Network tab.
- Select the Fetch/XHR filter.
- Scroll until the page loads another batch of records.
- Select the request that returns the next batch of items.
- Inspect the request URL, query string, request body, and headers.
Look for a page number, offset, cursor, after token, or a JSON body field that changes on each batch.
- Inspect the response payload and note the keys that hold the items and the next pagination value.
Common keys include items, results, entries, next, and next_cursor.
- Copy the request via Copy → Copy as cURL (bash).
Related: How to set request headers in Scrapy
Related: How to use cookies in Scrapy - Create a new Scrapy project.
$ scrapy startproject scrollfeed New Scrapy project 'scrollfeed', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in: /home/user/scrollfeed You can start your first spider with: cd /home/user/scrollfeed scrapy genspider example example.com - Change into the new project directory.
$ cd scrollfeed
- Generate a spider for the API host used by the scrolling request.
$ scrapy genspider feed api.example.net Created spider 'feed' using template 'basic' in module: scrollfeed.spiders.feed
- Edit scrollfeed/spiders/feed.py to replay the copied request and keep following the returned cursor.
- scrollfeed/spiders/feed.py
import scrapy class FeedSpider(scrapy.Spider): name = "feed" allowed_domains = ["api.example.net"] curl_command = "curl 'https://api.example.net/feed?cursor=0' -H 'Accept: application/json' -H 'Referer: https://www.example.net/feed'" custom_settings = { "AUTOTHROTTLE_ENABLED": True, "AUTOTHROTTLE_START_DELAY": 0.25, "AUTOTHROTTLE_MAX_DELAY": 10.0, "DOWNLOAD_DELAY": 0.25, } async def start(self): yield scrapy.Request.from_curl(self.curl_command, callback=self.parse_feed) def parse_feed(self, response): payload = response.json() for entry in payload.get("items", []): yield { "id": entry.get("id"), "title": entry.get("title"), } next_cursor = payload.get("next_cursor") if next_cursor is None: return yield response.request.replace( url=f"https://api.example.net/feed?cursor={next_cursor}", callback=self.parse_feed, )
Replace the copied cURL command, the allowed_domains entry, and the items/next_cursor keys to match the real endpoint. If the next token lives in a POST body instead of the URL, keep using response.request.replace() and pass an updated body= value. For spiders that must also run on Scrapy releases older than 2.13, add a matching start_requests() method for compatibility.
- Run the spider and export the collected items to JSON.
$ scrapy crawl feed -O items.json 2026-04-16 05:32:46 [scrapy.core.engine] INFO: Spider opened 2026-04-16 05:32:47 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: items.json 2026-04-16 05:32:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'item_scraped_count': 4} 2026-04-16 05:32:47 [scrapy.core.engine] INFO: Spider closed (finished)Scrolling endpoints are often rate limited and can still be covered by robots.txt rules, so keep delays reasonable and confirm the target site's crawling policy before scaling the spider up.
- Review the exported file to confirm later batches were written to the feed output.
$ cat items.json [ {"id": 101, "title": "First batch item"}, {"id": 102, "title": "Second batch item"}, {"id": 103, "title": "Third batch item"}, {"id": 104, "title": "Fourth batch item"} ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
