Pagination hides most records behind page numbers or a Next link, so a spider that only parses the first response produces incomplete datasets and quietly breaks downstream reporting. Correct pagination handling is the difference between “sampled a page” and “collected the catalogue”.

Scrapy crawls pages by parsing each Response in a callback (usually parse()), yielding scraped items and scheduling follow-up Request objects. Pagination becomes reliable when the spider consistently extracts the next-page URL from the current response and queues it with response.follow until no next page exists.

Pagination can loop (repeating URLs, “next” links that point back, calendar-like navigation) and can also be generated by JavaScript (infinite scroll, “Load more”), which Scrapy does not render in the browser sense. Crawl politely to reduce blocks and CAPTCHAs, and constrain scope with domain restrictions, URL patterns, or depth limits when the site structure is messy.

Steps to scrape paginated pages with Scrapy:

  1. Create a new Scrapy project.
    $ scrapy startproject pagination_demo
    New Scrapy project 'pagination_demo', using template directory '##### snipped #####', created in:
        /root/sg-work/pagination_demo
    
    You can start your first spider with:
        cd pagination_demo
        scrapy genspider example example.com
  2. Change to the new project directory.
    $ cd pagination_demo
  3. Generate a basic spider for the target domain.
    $ scrapy genspider listing app.internal.example
    Created spider 'listing' using template 'basic' in module:
      pagination_demo.spiders.listing
  4. Open a Scrapy shell session for the first listing page URL.
    $ scrapy shell 'http://app.internal.example:8000/products/'
    ##### snipped #####
    [s] Available Scrapy objects:
    [s]   response   <200 http://app.internal.example:8000/products/>
    ##### snipped #####
  5. Identify the pagination selector by extracting the next-page URL in the shell.
    >>> next_page = response.css('a.next::attr(href)').get()
    >>> next_page
    '/products?page=2'
    >>> response.urljoin(next_page)
    'http://app.internal.example:8000/products?page=2'
  6. Define a structured item in items.py for the fields to export.
    import scrapy
     
     
    class ListingItem(scrapy.Item):
        title = scrapy.Field()
        price = scrapy.Field()
        url = scrapy.Field()
  7. Update the spider to crawl pagination recursively.
    import scrapy
     
    from ..items import ListingItem
     
     
    class ListingSpider(scrapy.Spider):
        name = "listing"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/products/"]
     
        def parse(self, response):
            for card in response.css("article.product"):
                href = card.css("a.detail::attr(href)").get()
     
                item = ListingItem()
                item["title"] = card.css("h2::text").get(default="").strip()
                item["price"] = card.css("span.price::text").get(default="").strip()
                item["url"] = response.urljoin(href) if href else ""
     
                yield item
     
            next_page = response.css("a.next::attr(href)").get()
            if next_page:
                yield response.follow(next_page, callback=self.parse)

    Replace div.card, h2::text, a.next, and other selectors with the site’s actual markup from scrapy shell.

  8. Set crawl throttling options in settings.py for the target site.
    ROBOTSTXT_OBEY = True
    DOWNLOAD_DELAY = 1.0
    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 0.5
    AUTOTHROTTLE_MAX_DELAY = 10.0
    CONCURRENT_REQUESTS_PER_DOMAIN = 4

    Over-aggressive crawling can trigger rate limits, IP blocks, or CAPTCHAs, causing partial datasets and unstable runs.

  9. Run the spider with feed export enabled.
    $ scrapy crawl listing -O listing.json
    ##### snipped #####
    Stored json feed (6 items) in: listing.json

    Option -O overwrites the output file on each run.

  10. Verify pagination reached multiple pages by checking the crawl statistics.
    ##### snipped #####
    INFO: Dumping Scrapy stats:
    {'downloader/request_count': 4,
     'downloader/response_count': 4,
     'item_scraped_count': 6,
     'response_received_count': 4}
    INFO: Spider closed (finished)
  11. Confirm the exported JSON contains the expected number of records.
    $ python -c "import json; print(len(json.load(open('listing.json', 'r', encoding='utf-8'))))"
    6