Cookies preserve session state across requests, enabling access to pages that require authentication, region selection, or other server-side preferences. Supplying the right cookies can turn a blocked crawl into a predictable, repeatable scrape.

Scrapy sends cookies through its built-in CookiesMiddleware, which builds a standard Cookie header from each Request and stores any Set-Cookie values returned by the server. Cookies can be provided per request using the cookies= argument, while the cookiejar mechanism keeps multiple independent sessions separated when scraping more than one account in a single run.

Session cookies frequently expire, and logging or committing them can expose access to private data. Cookie scope also matters: some sites require a specific Domain or Path attribute, which may need the extended cookie format instead of a simple dictionary.

Steps to use cookies in Scrapy:

  1. Open the spider file used for session-protected pages.
    $ vi simplifiedguide/spiders/account.py
  2. Export the session cookie value as an environment variable.
    $ export SCRAPY_SESSIONID='abc123'

    Real session cookies grant account access; avoid committing them to version control or exposing them via shell history, logs, or crash reports.

  3. Create the authenticated request in the spider using the cookie value from the environment.
    import os
     
    import scrapy
     
     
    class AccountSpider(scrapy.Spider):
        name = "account"
        start_urls = ["http://app.internal.example:8000/account"]
     
        def start_requests(self):
            session_id = os.environ.get("SCRAPY_SESSIONID")
            if not session_id:
                raise RuntimeError("SCRAPY_SESSIONID is not set")
            for url in self.start_urls:
                yield scrapy.Request(
                    url=url,
                    cookies={"sessionid": session_id},
                    meta={"cookiejar": 1},
                    callback=self.parse_account,
                )
     
        def parse_account(self, response):
            yield {
                "account_name": response.css("h1::text").get(),
                "url": response.url,
            }

    Use the list-of-dictionaries cookie format when Domain or Path must be set: cookies=[{"name":"sessionid","value":"abc123","domain":".example.com","path":"/"}].

  4. Run the spider with COOKIES_DEBUG enabled to confirm the Cookie header is sent.
    $ scrapy crawl account -O account.json -s COOKIES_DEBUG=True -s LOG_LEVEL=DEBUG -s HTTPCACHE_ENABLED=False
    2026-01-01 08:48:49 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://app.internal.example:8000/account>
    Cookie: sessionid=abc123
    2026-01-01 08:48:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/account> (referer: None)
    2026-01-01 08:48:49 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: account.json

    COOKIES_DEBUG can print sensitive cookie values into logs; disable it after validation.

  5. Inspect the exported data to confirm the protected content is present.
    $ python3 -m json.tool account.json
    [
        {
            "account_name": "Example Account",
            "url": "http://app.internal.example:8000/account"
        }
    ]