Pagination is where many otherwise-correct spiders quietly fail. A spider that only parses the first category page usually looks healthy, but the exported data stays incomplete because the remaining records sit behind numbered pages or a Next link.
Scrapy handles HTML pagination by extracting the next page URL from the current Response and yielding another Request from the same callback. The crawl continues until the selector stops returning a new link, while the same callback keeps scraping items from each page that arrives.
Current Scrapy releases still make fixed entry points simple because the default start() implementation reads start_urls, and new projects already seed polite crawl defaults such as ROBOTSTXT_OBEY = True and a one-second DOWNLOAD_DELAY. If the listing loads the next batch through background XHR requests or browser-rendered content instead of an HTML link, switch to an API-focused or browser-rendered workflow instead of forcing plain pagination logic onto the page.
Related: How to use CrawlSpider in Scrapy
Related: How to set a crawl depth limit in Scrapy
Steps to scrape paginated pages with Scrapy:
- Create a new Scrapy project for the crawl.
$ scrapy startproject pagination_demo New Scrapy project 'pagination_demo', using template directory '##### snipped #####', created in: /home/user/pagination_demo You can start your first spider with: cd pagination_demo scrapy genspider example example.com - Change to the new project directory.
$ cd pagination_demo
- Generate a basic spider for the target domain.
$ scrapy genspider listing catalog.example.com Created spider 'listing' using template 'basic' in module: pagination_demo.spiders.listing
- Open the first listing page in the Scrapy shell.
$ scrapy shell 'https://catalog.example.com/products/page/1/' ##### snipped ##### [s] Available Scrapy objects: [s] response <200 https://catalog.example.com/products/page/1/> ##### snipped #####
- Extract the next-page link in the shell and resolve it to the full URL.
>>> response.css("a.next::attr(href)").get() '/products/page/2/' >>> response.urljoin(response.css("a.next::attr(href)").get()) 'https://catalog.example.com/products/page/2/'response.follow() can use the relative href directly, so the absolute URL check is mainly for selector validation.
- Review the generated crawl defaults in pagination_demo/settings.py before raising request rates.
- pagination_demo/settings.py
ROBOTSTXT_OBEY = True CONCURRENT_REQUESTS_PER_DOMAIN = 1 DOWNLOAD_DELAY = 1 FEED_EXPORT_ENCODING = "utf-8"
New Scrapy projects seed these values by default in current releases. Enable AUTOTHROTTLE only when the target site needs adaptive backoff. Related: How to enable AutoThrottle in Scrapy
Related: How to set a download delay in ScrapyIncreasing concurrency or removing delays too early can trigger rate limits, partial exports, or an IP block before the crawl finishes.
- Replace the generated spider with pagination-aware parsing logic.
- pagination_demo/spiders/listing.py
import scrapy class ListingSpider(scrapy.Spider): name = "listing" allowed_domains = ["catalog.example.com"] start_urls = ["https://catalog.example.com/products/page/1/"] def parse(self, response): for card in response.css("article.product"): href = card.css("a.detail::attr(href)").get() yield { "title": card.css("h2::text").get(default="").strip(), "price": card.css("span.price::text").get(default="").strip(), "url": response.urljoin(href) if href else "", } next_page = response.css("a.next::attr(href)").get() if next_page: yield response.follow(next_page, callback=self.parse)
Replace article.product, h2::text, span.price::text, a.detail, and a.next with selectors from the target site's actual HTML. Related: How to use CSS selectors in Scrapy
Related: How to follow links from a response in Scrapy - Run the spider and export the collected items to JSON.
$ scrapy crawl listing -O listing.json ##### snipped ##### 2026-04-16 06:16:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://catalog.example.com/products/page/1/> (referer: None) 2026-04-16 06:16:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://catalog.example.com/products/page/2/> (referer: https://catalog.example.com/products/page/1/) 2026-04-16 06:16:34 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: listing.json 2026-04-16 06:16:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_count': 3, 'item_scraped_count': 4, 'request_depth_max': 1} 2026-04-16 06:16:34 [scrapy.core.engine] INFO: Spider closed (finished)Option -O overwrites the output file on each run.
- Open the exported file to confirm that items from later pages were written to the feed.
$ cat listing.json [ {"title": "Starter Plan", "price": "$29", "url": "https://catalog.example.com/products/starter-plan.html"}, {"title": "Team Plan", "price": "$79", "url": "https://catalog.example.com/products/team-plan.html"}, {"title": "Growth Plan", "price": "$129", "url": "https://catalog.example.com/products/growth-plan.html"}, {"title": "Enterprise Plan", "price": "$249", "url": "https://catalog.example.com/products/enterprise-plan.html"} ]
Notes
- Keep the pagination selector narrow enough to match only the forward link. Calendar widgets, faceted navigation, and repeated sort controls can widen the crawl far beyond one results list.
- Keep start_urls when the first page is fixed, and move to async def start() only when the opening request depends on arguments, cookies, headers, or another runtime value.
- Use How to scrape an infinite scrolling page with Scrapy or How to scrape a JavaScript-rendered page with Scrapy using Playwright when the next batch is loaded through background requests or rendered client-side instead of an HTML pagination link.
- Add URL guards or a depth limit when the next link can loop back into archive calendars, search filters, or duplicate result pages.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
