Web crawls regularly hit non-success HTTP responses like 403, 404, and 500, and handling them deliberately keeps results predictable instead of silently dropping pages or cluttering logs with uncaught exceptions.

Scrapy passes responses through HttpErrorMiddleware before calling spider callbacks. By default, the middleware blocks non-200 responses by raising HttpError, so the callback never receives the response body or the status code for normal processing.

A spider can opt in to receiving specific non-200 responses by setting handle_httpstatus_list (or allowing everything with handle_httpstatus_all). Keeping the allowlist narrow avoids processing error pages as real content, and helps separate permanent failures (missing pages) from transient failures that retry logic can recover from.

Steps to handle HTTP error responses in Scrapy:

  1. Open the spider file that needs to process error responses.
    $ vi simplifiedguide/spiders/http_errors_demo.py
  2. Replace the spider class with a version that opts in to selected HTTP status codes.
    import scrapy
     
     
    class HttpErrorsDemoSpider(scrapy.Spider):
        name = "http_errors_demo"
        start_urls = [
            "http://app.internal.example:8000/",
            "http://app.internal.example:8000/errors/404",
        ]
        handle_httpstatus_list = [404, 500]
     
        def parse(self, response):
            if response.status in (404, 500):
                self.logger.info("Missing page (%s): %s", response.status, response.url)
                return
     
            name = response.css("h1::text").get(default="").strip()
     
            yield {
                "name": name,
                "url": response.url,
            }

    Opt in per request by setting meta={"handle_httpstatus_list": [404, 500]} on a Request when only specific URLs should be handled.

    Using handle_httpstatus_all can route every non-200 response into callbacks and quickly turn error pages into noisy “data”.

  3. Run the spider.
    $ scrapy crawl http_errors_demo -s LOG_LEVEL=DEBUG -s HTTPCACHE_ENABLED=False
    2026-01-01 08:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/> (referer: None)
    2026-01-01 08:49:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://app.internal.example:8000/>
    {'name': 'Example Portal', 'url': 'http://app.internal.example:8000/'}
    2026-01-01 08:49:17 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://app.internal.example:8000/errors/404> (referer: None)
    2026-01-01 08:49:18 [http_errors_demo] INFO: Missing page (404): http://app.internal.example:8000/errors/404
    ##### snipped #####