Web crawls regularly hit non-success HTTP responses like 403, 404, and 500, and handling them deliberately keeps results predictable instead of silently dropping pages or cluttering logs with uncaught exceptions.
Scrapy passes responses through HttpErrorMiddleware before calling spider callbacks. By default, the middleware blocks non-200 responses by raising HttpError, so the callback never receives the response body or the status code for normal processing.
A spider can opt in to receiving specific non-200 responses by setting handle_httpstatus_list (or allowing everything with handle_httpstatus_all). Keeping the allowlist narrow avoids processing error pages as real content, and helps separate permanent failures (missing pages) from transient failures that retry logic can recover from.
Related: How to set request headers in Scrapy \ Related: How to configure retries in Scrapy
Steps to handle HTTP error responses in Scrapy:
- Open the spider file that needs to process error responses.
$ vi simplifiedguide/spiders/http_errors_demo.py
- Replace the spider class with a version that opts in to selected HTTP status codes.
import scrapy class HttpErrorsDemoSpider(scrapy.Spider): name = "http_errors_demo" start_urls = [ "http://app.internal.example:8000/", "http://app.internal.example:8000/errors/404", ] handle_httpstatus_list = [404, 500] def parse(self, response): if response.status in (404, 500): self.logger.info("Missing page (%s): %s", response.status, response.url) return name = response.css("h1::text").get(default="").strip() yield { "name": name, "url": response.url, }
Opt in per request by setting meta={"handle_httpstatus_list": [404, 500]} on a Request when only specific URLs should be handled.
Using handle_httpstatus_all can route every non-200 response into callbacks and quickly turn error pages into noisy “data”.
- Run the spider.
$ scrapy crawl http_errors_demo -s LOG_LEVEL=DEBUG -s HTTPCACHE_ENABLED=False 2026-01-01 08:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/> (referer: None) 2026-01-01 08:49:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://app.internal.example:8000/> {'name': 'Example Portal', 'url': 'http://app.internal.example:8000/'} 2026-01-01 08:49:17 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://app.internal.example:8000/errors/404> (referer: None) 2026-01-01 08:49:18 [http_errors_demo] INFO: Missing page (404): http://app.internal.example:8000/errors/404 ##### snipped #####
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
