How to scrape an infinite scrolling page with Scrapy

Infinite scrolling pages hide most content behind on-demand loading, so a single HTML response rarely contains the full dataset. Capturing every batch matters for building reliable archives, monitoring listings, or analyzing timelines without missing late-loaded entries.

Most infinite scroll implementations render an initial view and fetch additional batches via background Fetch or XHR requests. Those requests typically call a paginated endpoint that accepts an offset, page number, or cursor token and returns both the next batch of items and a token or value for the following request.

Scrapy is most effective when it targets the batch endpoint directly, extracts items from each response, and schedules the next request from the returned pagination value. Pages that require full JavaScript execution, complex fingerprinting, or interactive scrolling can require a browser-rendering workflow instead of pure HTTP crawling.

Steps to scrape an infinite scrolling page with Scrapy:

Open the target page that uses infinite scrolling.
Open the browser developer tools.
Select the Network tab.
Enable the Fetch/XHR filter.
Scroll the page until a new batch request appears.
Select the request that returns the next batch of items.
Copy the request via Copy → Copy as cURL (bash).

The copied cURL command preserves the exact endpoint, query parameters, and any required headers.
Identify the pagination parameter name in the request URL.

Common parameter names include page, offset, start, cursor, after, and next.
Locate the next cursor or offset value in the response payload.

Cursor-style APIs usually return a next_cursor/next token, while offset-style APIs often return a total or the next offset value.

Create a new Scrapy project.

$ scrapy startproject scrollfeed
New Scrapy project 'scrollfeed', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
    /root/sg-work/scrollfeed

Change into the new project directory.
```
$ cd scrollfeed
```

Generate a spider for the target domain.

$ scrapy genspider feed api.example.net
Created spider 'feed' using template 'basic' in module:
  scrollfeed.spiders.feed

Edit scrollfeed/spiders/feed.py to paginate the scrolling endpoint.

scrollfeed/spiders/feed.py

import json
from urllib.parse import urlencode
 
import scrapy
 
 
class FeedSpider(scrapy.Spider):
    name = "feed"
    allowed_domains = ["api.example.net"]
    api_url = "http://api.example.net:8000/api/scroll"
    page_size = 50
 
    custom_settings = {
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_START_DELAY": 0.25,
        "AUTOTHROTTLE_MAX_DELAY": 10.0,
        "DOWNLOAD_DELAY": 0.25,
        "ROBOTSTXT_OBEY": True,
    }
 
    def start_requests(self):
        params = {"limit": self.page_size}
        url = f"{self.api_url}?{urlencode(params)}"
        yield scrapy.Request(url=url, callback=self.parse)
 
    def parse(self, response):
        payload = json.loads(response.text)
 
        for entry in payload.get("items", []):
            yield {
                "id": entry.get("id"),
                "title": entry.get("title"),
            }
 
        next_cursor = payload.get("next_cursor")
        if not next_cursor:
            return
 
        params = {"limit": self.page_size, "cursor": next_cursor}
        next_url = f"{self.api_url}?{urlencode(params)}"
        yield scrapy.Request(url=next_url, callback=self.parse)

Update api_url, the cursor parameter name, and the items/next_cursor keys to match the copied request and response.

Run the spider with JSON output.

$ scrapy crawl feed -O items.json
2026-01-01 09:50:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'item_scraped_count': 6,
 'finish_reason': 'finished'}
2026-01-01 09:50:39 [scrapy.core.engine] INFO: Spider closed (finished)

High request rates can trigger rate limiting or IP bans, especially on cursor endpoints that are easy to replay.

Count the scraped items in the output file.

$ python3 -c 'import json; print(len(json.load(open("items.json", encoding="utf-8"))))'
6

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.