Routing Scrapy requests through an HTTP proxy supports controlled egress, geo-specific access, and separation between scraping workloads and the originating network.
Scrapy sends each request through a downloader stack built on Twisted. Proxy use is applied per-request via the proxy value in Request.meta, and Scrapy’s HttpProxyMiddleware translates that metadata into the correct connection behavior (including CONNECT tunneling for HTTPS targets).
Proxies can observe and potentially modify traffic, and shared proxies can add latency or unstable results. Only use proxies that are trusted and permitted for the target site, keep proxy credentials out of source control, and keep crawling behavior polite with reasonable concurrency and delays.
Steps to use an HTTP proxy in Scrapy:
- Prepare a proxy URL in scheme://host:port format.
http://proxy.example.net:8888 http://username:password@proxy.example.net:8888
URL-encode reserved characters in usernames or passwords (such as @, :, or /) to avoid parsing errors.
Most forward proxies use an http:// proxy URL even when requesting HTTPS pages.
- Set the proxy configuration values in settings.py.
- settings.py
PROXY_URL = "http://proxy.example.net:8888" PROXY_LIST = []
Populate PROXY_LIST with multiple proxy URLs to rotate proxies per request.
- Add a downloader middleware that sets request.meta['proxy'] from the project settings.
- middlewares.py
import random from typing import List, Optional from scrapy.crawler import Crawler from scrapy.http import Request class ProxyMiddleware: """Injects a proxy URL into request.meta when proxying is enabled in settings.""" def __init__(self, proxy_url: Optional[str], proxy_list: List[str]) -> None: self._proxy_url = proxy_url.strip() if isinstance(proxy_url, str) else None self._proxy_list = [p.strip() for p in proxy_list if isinstance(p, str) and p.strip()] @classmethod def from_crawler(cls, crawler: Crawler) -> "ProxyMiddleware": proxy_url = crawler.settings.get("PROXY_URL") proxy_list = crawler.settings.getlist("PROXY_LIST", []) return cls(proxy_url=proxy_url, proxy_list=proxy_list) def process_request(self, request: Request, spider) -> None: if request.meta.get("dont_proxy"): return if request.meta.get("proxy"): return proxy = self._pick_proxy() if not proxy: return request.meta["proxy"] = proxy def _pick_proxy(self) -> Optional[str]: if self._proxy_list: return random.choice(self._proxy_list) return self._proxy_url
- Enable the proxy downloader middleware stack in settings.py.
- settings.py
HTTPPROXY_ENABLED = True DOWNLOADER_MIDDLEWARES = { "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110, "myproject.middlewares.ProxyMiddleware": 100, }
Merge the entries into an existing DOWNLOADER_MIDDLEWARES dictionary instead of overwriting unrelated middleware keys.
Replace myproject with the Scrapy project module name.
- Override the proxy for a single request by setting meta['proxy'] in the spider.
- spiders/example_spider.py
import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://app.internal.example:8000/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={"proxy": "http://proxy.example.net:8888"}, )
- Skip proxy injection for a request by setting meta['dont_proxy'] to True.
- spiders/example_spider.py
import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://app.internal.example:8000/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={"dont_proxy": True}, )
- Override request headers when a target expects them.
- spiders/example_spider.py
import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://app.internal.example:8000/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, headers={ "User-Agent": "Mozilla/5.0", "Referer": "http://app.internal.example:8000/", }, )
- Start a Scrapy shell request from the project directory.
$ scrapy shell -s HTTPCACHE_ENABLED=False "http://app.internal.example:8000/headers"
Proxies that do not support tunneling may fail on HTTPS targets, often as timeouts or Tunnel connection failed errors.
- Fetch the endpoint through the proxy and confirm the proxy adds a Via header.
>> fetch("http://app.internal.example:8000/headers", meta={"proxy": "http://proxy.example.net:8888"}) >>> import json >>> json.loads(response.text)["headers"]["Via"] '1.1 tinyproxy (tinyproxy/1.11.1)'
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
