Routing Scrapy requests through an HTTP proxy supports controlled egress, geo-specific access, and separation between scraping workloads and the originating network.
Scrapy sends each request through a downloader stack built on Twisted. Proxy use is applied per-request via the proxy value in Request.meta, and Scrapy’s HttpProxyMiddleware translates that metadata into the correct connection behavior (including CONNECT tunneling for HTTPS targets).
Proxies can observe and potentially modify traffic, and shared proxies can add latency or unstable results. Only use proxies that are trusted and permitted for the target site, keep proxy credentials out of source control, and keep crawling behavior polite with reasonable concurrency and delays.
http://proxy.example.net:8888 http://username:password@proxy.example.net:8888
URL-encode reserved characters in usernames or passwords (such as @, :, or /) to avoid parsing errors.
Most forward proxies use an http:// proxy URL even when requesting HTTPS pages.
PROXY_URL = "http://proxy.example.net:8888" PROXY_LIST = []
Populate PROXY_LIST with multiple proxy URLs to rotate proxies per request.
import random from typing import List, Optional from scrapy.crawler import Crawler from scrapy.http import Request class ProxyMiddleware: """Injects a proxy URL into request.meta when proxying is enabled in settings.""" def __init__(self, proxy_url: Optional[str], proxy_list: List[str]) -> None: self._proxy_url = proxy_url.strip() if isinstance(proxy_url, str) else None self._proxy_list = [p.strip() for p in proxy_list if isinstance(p, str) and p.strip()] @classmethod def from_crawler(cls, crawler: Crawler) -> "ProxyMiddleware": proxy_url = crawler.settings.get("PROXY_URL") proxy_list = crawler.settings.getlist("PROXY_LIST", []) return cls(proxy_url=proxy_url, proxy_list=proxy_list) def process_request(self, request: Request, spider) -> None: if request.meta.get("dont_proxy"): return if request.meta.get("proxy"): return proxy = self._pick_proxy() if not proxy: return request.meta["proxy"] = proxy def _pick_proxy(self) -> Optional[str]: if self._proxy_list: return random.choice(self._proxy_list) return self._proxy_url
HTTPPROXY_ENABLED = True DOWNLOADER_MIDDLEWARES = { "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110, "myproject.middlewares.ProxyMiddleware": 100, }
Merge the entries into an existing DOWNLOADER_MIDDLEWARES dictionary instead of overwriting unrelated middleware keys.
Replace myproject with the Scrapy project module name.
import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://app.internal.example:8000/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={"proxy": "http://proxy.example.net:8888"}, )
import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://app.internal.example:8000/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={"dont_proxy": True}, )
import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://app.internal.example:8000/"] def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, headers={ "User-Agent": "Mozilla/5.0", "Referer": "http://app.internal.example:8000/", }, )
$ scrapy shell -s HTTPCACHE_ENABLED=False "http://app.internal.example:8000/headers"
Proxies that do not support tunneling may fail on HTTPS targets, often as timeouts or Tunnel connection failed errors.
>> fetch("http://app.internal.example:8000/headers", meta={"proxy": "http://proxy.example.net:8888"})
>>> import json
>>> json.loads(response.text)["headers"]["Via"]
'1.1 tinyproxy (tinyproxy/1.11.1)'