Web crawls regularly hit non-success HTTP responses like 403, 404, and 500, and handling them deliberately keeps results predictable instead of silently dropping pages or cluttering logs with uncaught exceptions.
Scrapy passes responses through HttpErrorMiddleware before calling spider callbacks. By default, the middleware blocks non-200 responses by raising HttpError, so the callback never receives the response body or the status code for normal processing.
A spider can opt in to receiving specific non-200 responses by setting handle_httpstatus_list (or allowing everything with handle_httpstatus_all). Keeping the allowlist narrow avoids processing error pages as real content, and helps separate permanent failures (missing pages) from transient failures that retry logic can recover from.
Related: How to set request headers in Scrapy \ Related: How to configure retries in Scrapy
$ vi simplifiedguide/spiders/http_errors_demo.py
import scrapy class HttpErrorsDemoSpider(scrapy.Spider): name = "http_errors_demo" start_urls = [ "http://app.internal.example:8000/", "http://app.internal.example:8000/errors/404", ] handle_httpstatus_list = [404, 500] def parse(self, response): if response.status in (404, 500): self.logger.info("Missing page (%s): %s", response.status, response.url) return name = response.css("h1::text").get(default="").strip() yield { "name": name, "url": response.url, }
Opt in per request by setting meta={"handle_httpstatus_list": [404, 500]} on a Request when only specific URLs should be handled.
Using handle_httpstatus_all can route every non-200 response into callbacks and quickly turn error pages into noisy “data”.
$ scrapy crawl http_errors_demo -s LOG_LEVEL=DEBUG -s HTTPCACHE_ENABLED=False
2026-01-01 08:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/> (referer: None)
2026-01-01 08:49:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://app.internal.example:8000/>
{'name': 'Example Portal', 'url': 'http://app.internal.example:8000/'}
2026-01-01 08:49:17 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://app.internal.example:8000/errors/404> (referer: None)
2026-01-01 08:49:18 [http_errors_demo] INFO: Missing page (404): http://app.internal.example:8000/errors/404
##### snipped #####