Pagination hides most records behind page numbers or a Next link, so a spider that only parses the first response produces incomplete datasets and quietly breaks downstream reporting. Correct pagination handling is the difference between “sampled a page” and “collected the catalogue”.
Scrapy crawls pages by parsing each Response in a callback (usually parse()), yielding scraped items and scheduling follow-up Request objects. Pagination becomes reliable when the spider consistently extracts the next-page URL from the current response and queues it with response.follow until no next page exists.
Pagination can loop (repeating URLs, “next” links that point back, calendar-like navigation) and can also be generated by JavaScript (infinite scroll, “Load more”), which Scrapy does not render in the browser sense. Crawl politely to reduce blocks and CAPTCHAs, and constrain scope with domain restrictions, URL patterns, or depth limits when the site structure is messy.
Related: How to use CrawlSpider in Scrapy
Related: How to set a crawl depth limit in Scrapy
$ scrapy startproject pagination_demo
New Scrapy project 'pagination_demo', using template directory '##### snipped #####', created in:
/root/sg-work/pagination_demo
You can start your first spider with:
cd pagination_demo
scrapy genspider example example.com
$ cd pagination_demo
$ scrapy genspider listing app.internal.example Created spider 'listing' using template 'basic' in module: pagination_demo.spiders.listing
$ scrapy shell 'http://app.internal.example:8000/products/' ##### snipped ##### [s] Available Scrapy objects: [s] response <200 http://app.internal.example:8000/products/> ##### snipped #####
>>> next_page = response.css('a.next::attr(href)').get()
>>> next_page
'/products?page=2'
>>> response.urljoin(next_page)
'http://app.internal.example:8000/products?page=2'
import scrapy class ListingItem(scrapy.Item): title = scrapy.Field() price = scrapy.Field() url = scrapy.Field()
import scrapy from ..items import ListingItem class ListingSpider(scrapy.Spider): name = "listing" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/products/"] def parse(self, response): for card in response.css("article.product"): href = card.css("a.detail::attr(href)").get() item = ListingItem() item["title"] = card.css("h2::text").get(default="").strip() item["price"] = card.css("span.price::text").get(default="").strip() item["url"] = response.urljoin(href) if href else "" yield item next_page = response.css("a.next::attr(href)").get() if next_page: yield response.follow(next_page, callback=self.parse)
Replace div.card, h2::text, a.next, and other selectors with the site’s actual markup from scrapy shell.
ROBOTSTXT_OBEY = True DOWNLOAD_DELAY = 1.0 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 0.5 AUTOTHROTTLE_MAX_DELAY = 10.0 CONCURRENT_REQUESTS_PER_DOMAIN = 4
Over-aggressive crawling can trigger rate limits, IP blocks, or CAPTCHAs, causing partial datasets and unstable runs.
$ scrapy crawl listing -O listing.json ##### snipped ##### Stored json feed (6 items) in: listing.json
Option -O overwrites the output file on each run.
##### snipped #####
INFO: Dumping Scrapy stats:
{'downloader/request_count': 4,
'downloader/response_count': 4,
'item_scraped_count': 6,
'response_received_count': 4}
INFO: Spider closed (finished)
$ python -c "import json; print(len(json.load(open('listing.json', 'r', encoding='utf-8'))))"
6