Infinite scrolling pages hide most content behind on-demand loading, so a single HTML response rarely contains the full dataset. Capturing every batch matters for building reliable archives, monitoring listings, or analyzing timelines without missing late-loaded entries.

Most infinite scroll implementations render an initial view and fetch additional batches via background Fetch or XHR requests. Those requests typically call a paginated endpoint that accepts an offset, page number, or cursor token and returns both the next batch of items and a token or value for the following request.

Scrapy is most effective when it targets the batch endpoint directly, extracts items from each response, and schedules the next request from the returned pagination value. Pages that require full JavaScript execution, complex fingerprinting, or interactive scrolling can require a browser-rendering workflow instead of pure HTTP crawling.

Steps to scrape an infinite scrolling page with Scrapy:

  1. Open the target page that uses infinite scrolling.
  2. Open the browser developer tools.
  3. Select the Network tab.
  4. Enable the Fetch/XHR filter.
  5. Scroll the page until a new batch request appears.
  6. Select the request that returns the next batch of items.
  7. Copy the request via CopyCopy as cURL (bash).

    The copied cURL command preserves the exact endpoint, query parameters, and any required headers.

  8. Identify the pagination parameter name in the request URL.

    Common parameter names include page, offset, start, cursor, after, and next.

  9. Locate the next cursor or offset value in the response payload.

    Cursor-style APIs usually return a next_cursor/next token, while offset-style APIs often return a total or the next offset value.

  10. Create a new Scrapy project.
    $ scrapy startproject scrollfeed
    New Scrapy project 'scrollfeed', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
        /root/sg-work/scrollfeed
  11. Change into the new project directory.
    $ cd scrollfeed
  12. Generate a spider for the target domain.
    $ scrapy genspider feed api.example.net
    Created spider 'feed' using template 'basic' in module:
      scrollfeed.spiders.feed
  13. Edit scrollfeed/spiders/feed.py to paginate the scrolling endpoint.
    scrollfeed/spiders/feed.py
    import json
    from urllib.parse import urlencode
     
    import scrapy
     
     
    class FeedSpider(scrapy.Spider):
        name = "feed"
        allowed_domains = ["api.example.net"]
        api_url = "http://api.example.net:8000/api/scroll"
        page_size = 50
     
        custom_settings = {
            "AUTOTHROTTLE_ENABLED": True,
            "AUTOTHROTTLE_START_DELAY": 0.25,
            "AUTOTHROTTLE_MAX_DELAY": 10.0,
            "DOWNLOAD_DELAY": 0.25,
            "ROBOTSTXT_OBEY": True,
        }
     
        def start_requests(self):
            params = {"limit": self.page_size}
            url = f"{self.api_url}?{urlencode(params)}"
            yield scrapy.Request(url=url, callback=self.parse)
     
        def parse(self, response):
            payload = json.loads(response.text)
     
            for entry in payload.get("items", []):
                yield {
                    "id": entry.get("id"),
                    "title": entry.get("title"),
                }
     
            next_cursor = payload.get("next_cursor")
            if not next_cursor:
                return
     
            params = {"limit": self.page_size, "cursor": next_cursor}
            next_url = f"{self.api_url}?{urlencode(params)}"
            yield scrapy.Request(url=next_url, callback=self.parse)

    Update api_url, the cursor parameter name, and the items/next_cursor keys to match the copied request and response.

  14. Run the spider with JSON output.
    $ scrapy crawl feed -O items.json
    2026-01-01 09:50:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'item_scraped_count': 6,
     'finish_reason': 'finished'}
    2026-01-01 09:50:39 [scrapy.core.engine] INFO: Spider closed (finished)

    High request rates can trigger rate limiting or IP bans, especially on cursor endpoints that are easy to replay.

  15. Count the scraped items in the output file.
    $ python3 -c 'import json; print(len(json.load(open("items.json", encoding="utf-8"))))'
    6