How to use an HTTP proxy in Scrapy

Using an HTTP proxy in Scrapy sends selected requests through another network path, which is useful when a crawl must leave through an approved egress IP, a provider-managed proxy pool, or a controlled inspection point before it reaches the target site.

Scrapy applies proxy routing through its built-in HttpProxyMiddleware. You can set meta["proxy"] on an individual Request, or let the middleware read http_proxy and https_proxy from the process environment, and a request-level meta["proxy"] value overrides those environment variables and ignores no_proxy for that request.

A proxy adds another failure point and exposes request data to another operator, so test it against a simple echo endpoint before using it in a larger crawl. URL-encode reserved characters in proxy credentials, change HTTPPROXY_AUTH_ENCODING when the proxy username or password needs more than the default latin-1 encoding, and keep in mind that current Scrapy docs note that HttpxDownloadHandler does not support the proxy request meta key.

Steps to use an HTTP proxy in Scrapy:

  1. Change to the Scrapy project directory before reading the active settings or editing spider code.
    $ cd /srv/proxydemo
  2. Confirm the project still has proxy middleware enabled.
    $ scrapy settings --get HTTPPROXY_ENABLED
    True

    Fresh projects leave this enabled by default, but projects that replace DOWNLOADER_MIDDLEWARES must keep scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware active.

  3. Open the spider file that will send the proxied request.
    $ vi proxydemo/spiders/proxy_check.py
  4. Add the proxy URL to the request metadata and log the origin that the target sees.
    import json
    import scrapy
     
     
    class ProxyCheckSpider(scrapy.Spider):
        name = "proxy_check"
     
        async def start(self):
            yield scrapy.Request(
                "http://origin.example.net/ip",
                meta={"proxy": "http://proxy-user:proxy-pass@proxy.example.net:8888"},
                callback=self.parse_ip,
            )
     
        def parse_ip(self, response):
            payload = json.loads(response.text)
            self.logger.info("origin seen by target: %s", payload["origin"])
            yield {"origin": payload["origin"]}

    Scrapy reads proxy credentials from the proxy URL, sends Proxy-Authorization automatically when a username is present, and applies the same proxy only to requests that carry that meta["proxy"] value, including callback-generated follow-up requests. Set HTTPPROXY_AUTH_ENCODING = “utf-8” when proxy credentials use characters outside the default latin-1 range, and URL-encode reserved characters before placing them in the URL.

  5. Run the spider and confirm the logged origin matches the proxy egress address.
    $ scrapy crawl proxy_check -L INFO
    ##### snipped #####
    [proxy_check] INFO: origin seen by target: 198.51.100.24
    [scrapy.core.engine] INFO: Spider closed (finished)

    Use an echo endpoint that reports the client IP or request headers, and treat the proxy egress address as the success check rather than the crawler host's own address.

  6. Open scrapy shell and repeat one request through the same proxy before applying the pattern to more crawl paths.
    $ scrapy shell --nolog
    >>> import json
    >>> fetch("http://origin.example.net/ip", meta={"proxy": "http://proxy-user:proxy-pass@proxy.example.net:8888"})
    >>> json.loads(response.text)
    {'origin': '198.51.100.24'}

    If the crawler process exports http_proxy or https_proxy and one request must go direct, set meta["proxy"] = None on that request.