How to scrape paginated pages with Scrapy

Pagination hides most records behind page numbers or a Next link, so a spider that only parses the first response produces incomplete datasets and quietly breaks downstream reporting. Correct pagination handling is the difference between “sampled a page” and “collected the catalogue”.

Scrapy crawls pages by parsing each Response in a callback (usually parse()), yielding scraped items and scheduling follow-up Request objects. Pagination becomes reliable when the spider consistently extracts the next-page URL from the current response and queues it with response.follow until no next page exists.

Pagination can loop (repeating URLs, “next” links that point back, calendar-like navigation) and can also be generated by JavaScript (infinite scroll, “Load more”), which Scrapy does not render in the browser sense. Crawl politely to reduce blocks and CAPTCHAs, and constrain scope with domain restrictions, URL patterns, or depth limits when the site structure is messy.

Steps to scrape paginated pages with Scrapy:

Create a new Scrapy project.

$ scrapy startproject pagination_demo
New Scrapy project 'pagination_demo', using template directory '##### snipped #####', created in:
    /root/sg-work/pagination_demo

You can start your first spider with:
    cd pagination_demo
    scrapy genspider example example.com

Change to the new project directory.
```
$ cd pagination_demo
```

Generate a basic spider for the target domain.

$ scrapy genspider listing app.internal.example
Created spider 'listing' using template 'basic' in module:
  pagination_demo.spiders.listing

Open a Scrapy shell session for the first listing page URL.

$ scrapy shell 'http://app.internal.example:8000/products/'
##### snipped #####
[s] Available Scrapy objects:
[s]   response   <200 http://app.internal.example:8000/products/>
##### snipped #####

Identify the pagination selector by extracting the next-page URL in the shell.

>>> next_page = response.css('a.next::attr(href)').get()
>>> next_page
'/products?page=2'
>>> response.urljoin(next_page)
'http://app.internal.example:8000/products?page=2'

Define a structured item in items.py for the fields to export.

import scrapy
 
 
class ListingItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

Update the spider to crawl pagination recursively.

import scrapy
 
from ..items import ListingItem
 
 
class ListingSpider(scrapy.Spider):
    name = "listing"
    allowed_domains = ["app.internal.example"]
    start_urls = ["http://app.internal.example:8000/products/"]
 
    def parse(self, response):
        for card in response.css("article.product"):
            href = card.css("a.detail::attr(href)").get()
 
            item = ListingItem()
            item["title"] = card.css("h2::text").get(default="").strip()
            item["price"] = card.css("span.price::text").get(default="").strip()
            item["url"] = response.urljoin(href) if href else ""
 
            yield item
 
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Replace div.card, h2::text, a.next, and other selectors with the site’s actual markup from scrapy shell.

Set crawl throttling options in settings.py for the target site.

ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1.0
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 10.0
CONCURRENT_REQUESTS_PER_DOMAIN = 4

Over-aggressive crawling can trigger rate limits, IP blocks, or CAPTCHAs, causing partial datasets and unstable runs.

Run the spider with feed export enabled.

$ scrapy crawl listing -O listing.json
##### snipped #####
Stored json feed (6 items) in: listing.json

Option -O overwrites the output file on each run.

Verify pagination reached multiple pages by checking the crawl statistics.

##### snipped #####
INFO: Dumping Scrapy stats:
{'downloader/request_count': 4,
 'downloader/response_count': 4,
 'item_scraped_count': 6,
 'response_received_count': 4}
INFO: Spider closed (finished)

Confirm the exported JSON contains the expected number of records.

$ python -c "import json; print(len(json.load(open('listing.json', 'r', encoding='utf-8'))))"
6

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.