Using an HTTP proxy in Scrapy routes outbound requests through a different network path, which is useful for controlled egress, IP-based access rules, and provider-managed proxy pools.

Current Scrapy applies proxy routing through its built-in HttpProxyMiddleware. In current releases, the common pattern is to set meta["proxy"] on each Request that should use the proxy, and new spider examples should use start() instead of older start_requests() snippets.

A proxy can slow requests, fail independently of the target site, or expose traffic to another operator. Use only approved proxies, keep credentials out of source control, and verify the routed request against a simple HTTP echo endpoint before using the same setting in a larger crawl.

Steps to use an HTTP proxy in Scrapy:

  1. Open the Scrapy project directory that contains scrapy.cfg.
    $ cd /srv/proxydemo
  2. Confirm the project has not disabled proxy support.
    $ scrapy settings --get HTTPPROXY_ENABLED
    True

    Current Scrapy projects leave this enabled by default, so you only need to restore HTTPPROXY_ENABLED = True when an older project or custom settings turned it off. If the project replaces DOWNLOADER_MIDDLEWARES, make sure scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware is still enabled there as well.

  3. Add the proxy URL to the request metadata in the spider.
    import scrapy
     
     
    class HeadersSpider(scrapy.Spider):
        name = "headers"
     
        async def start(self):
            yield scrapy.Request(
                "http://origin.example.net/headers",
                meta={"proxy": "http://proxy-user:proxy-pass@proxy.example.net:8888"},
            )
     
        def parse(self, response):
            via_header = [value.decode() for value in response.headers.getlist("Via")]
            self.logger.info("Via header: %s", via_header)
            yield {
                "status": response.status,
                "url": response.url,
            }

    Scrapy reads proxy credentials from the proxy URL itself, so authenticated proxies use the same meta["proxy"] field instead of a separate request option.

    URL-encode reserved characters in the proxy username or password before placing them in the proxy URL, or the middleware can parse the credentials incorrectly.

  4. Run the spider and confirm the response came back through the proxy.
    $ scrapy crawl headers -L INFO
    ##### snipped #####
    [headers] INFO: Via header: ['1.1 proxy.example.net']
    [scrapy.core.engine] INFO: Spider closed (finished)

    The sample Via header proves the request passed through the proxy before Scrapy received the response.

  5. Open scrapy shell and re-test the same URL through the proxy before applying the pattern to more requests.
    $ scrapy shell 'http://origin.example.net/headers' --nolog
    >>> fetch(response.url, meta={"proxy": "http://proxy.example.net:8888"})
    >>> response.status
    200
    >>> response.headers.getlist("Via")
    [b'1.1 proxy.example.net']

    If your proxy does not add a Via header, confirm routing in the proxy access log or by checking the outbound IP or headers that the target test endpoint reports.

Notes

  • Set meta["proxy"] on every Request that should use the proxy, including follow-up requests created in callbacks.
  • Set meta["proxy"] = None on one request when the process inherited http_proxy or https_proxy from the environment but that specific request must go direct.
  • Current Scrapy uses latin-1 for proxy credentials unless HTTPPROXY_AUTH_ENCODING is changed, so switch that setting to utf-8 when the proxy username or password contains characters outside latin-1.
  • Older projects that must stay compatible with pre-2.13 Scrapy may still use start_requests(), but current Scrapy examples should use start() for new spider entrypoints.