HTTP error responses such as 404 and 500 are normal crawl outcomes, and handling them deliberately lets a spider log the failure, branch to fallback logic, or skip an error page before it turns into a bad item.

Current Scrapy releases pass responses through HttpErrorMiddleware before spider callbacks run. By default that middleware filters non-2xx responses, so parse() does not receive the response unless the spider, the request, or the project settings explicitly allow that status code.

Keep the allowlist narrow and check response.status as soon as the callback runs. Broad switches such as handle_httpstatus_all or HTTPERROR_ALLOW_ALL are useful for debugging, but they also route every error response into normal parsing and can quickly pollute items or exports with failure pages.

Steps to handle HTTP error responses in Scrapy:

  1. Open the spider file that should receive the error response.
    $ vi http_errors_demo.py
  2. Add a focused allowlist and branch on response.status inside the callback.
    import scrapy
     
    class HttpErrorsDemoSpider(scrapy.Spider):
        name = "http_errors_demo"
        start_urls = [
            "http://app.internal.example:8000/",
            "http://app.internal.example:8000/missing",
        ]
        handle_httpstatus_list = [404]
     
        def parse(self, response):
            if response.status == 404:
                self.logger.info("Handled HTTP %s for %s", response.status, response.url)
                return
     
            yield {
                "title": response.css("h1::text").get(),
                "url": response.url,
            }

    Set meta={"handle_httpstatus_list": [404]} on a Request when only one request should receive that status code. Related: How to use request meta in Scrapy

  3. Run the spider and confirm the callback receives the 404 response instead of letting HttpErrorMiddleware drop it.
    $ scrapy runspider http_errors_demo.py -s LOG_LEVEL=INFO -s HTTPCACHE_ENABLED=False
    2026-04-16 06:18:17 [scrapy.core.engine] INFO: Spider opened
    2026-04-16 06:18:17 [http_errors_demo] INFO: Handled HTTP 404 for http://app.internal.example:8000/missing
    2026-04-16 06:18:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 1,
     'item_scraped_count': 1,
    ##### snipped #####
    }
    2026-04-16 06:18:17 [scrapy.core.engine] INFO: Spider closed (finished)

    The spider log line proves the 404 response reached parse(), and the stats confirm Scrapy still counted both the successful page and the handled error response.

    Disable or clear HTTPCACHE while testing status handling so cached responses do not hide changes in target behavior or middleware rules.

Notes

  • Set HTTPERROR_ALLOWED_CODES in settings.py when the same allowlist should apply across multiple spiders in one project.
  • Use handle_httpstatus_all or HTTPERROR_ALLOW_ALL only for debugging or dedicated error-capture spiders, because they pass every non-2xx response into callbacks.
  • Use an errback for download exceptions such as DNS failures, refused connections, or timeouts, because those failures do not produce an HTTP response object for handle_httpstatus_list to inspect.
  • Inside a full Scrapy project, the same spider attribute works with scrapy crawl <spider_name> instead of scrapy runspider.