CSRF-protected login pages block crawlers until the request includes the same hidden token and session context that a browser sends. A Scrapy spider that handles that token correctly can reach account pages, member-only listings, and other authenticated responses without stalling on repeated redirects back to the login screen.

Scrapy uses FormRequest.from_response() to build a POST request from the live login form, which means hidden inputs such as a csrf_token, submit button values, and the session cookie from the login page can stay in the request automatically. That keeps the login flow closer to the browser path and avoids hard-coding one-time token values into the spider.

Token names, form selectors, and redirect behavior vary by site, and some logins add JavaScript-generated fields, CAPTCHA, or multi-factor prompts that a plain form POST cannot satisfy. Pass credentials without committing them to source control, and treat repeated login failures as a signal to re-check the live form instead of blindly retrying the same request.

Steps to authenticate with a CSRF login form in Scrapy:

  1. Inspect the login form in scrapy shell so the spider uses the correct form selector and field names.
    $ scrapy shell "https://app.internal.example/login" --nolog
    >>> response.css('form#login-form input::attr(name)').getall()
    ['csrf_token', 'username', 'password', 'submit']
    >>> response.css('form#login-form input[name="csrf_token"]::attr(value)').get()
    'csrf-9c2d1a4b'
  2. Replace authlogin/spiders/account.py with a login-first spider that posts the form built from the live response.
    import scrapy
    from scrapy.exceptions import CloseSpider
    from scrapy.http import FormRequest
     
    class AccountSpider(scrapy.Spider):
        name = "account"
        allowed_domains = ["app.internal.example"]
        login_url = "https://app.internal.example/login"
        account_url = "https://app.internal.example/account"
     
        async def start(self):
            if not getattr(self, "username", None) or not getattr(self, "password", None):
                raise CloseSpider("Pass -a username=... -a password=...")
            yield scrapy.Request(self.login_url, callback=self.parse_login, dont_filter=True)
     
        def parse_login(self, response):
            yield FormRequest.from_response(
                response,
                formcss="form#login-form",
                formdata={
                    "username": self.username,
                    "password": self.password,
                },
                headers={"Referer": response.url},
                callback=self.after_login,
                dont_filter=True,
            )
     
        def after_login(self, response):
            if response.css("form#login-form input[name='csrf_token']"):
                raise CloseSpider("Login failed; still on the login form.")
            yield response.follow(self.account_url, callback=self.parse_account, dont_filter=True)
     
        def parse_account(self, response):
            yield {
                "account_name": response.css("h1::text").get(default="").strip(),
                "url": response.url,
            }

    FormRequest.from_response() reuses hidden fields already present in the form, so the CSRF token does not need to be copied into formdata manually. Use formcss or formid when the page has multiple forms, add dont_click=True if the default submit-button click adds the wrong payload, and move the same initial request into start_requests() when maintaining an older pre-start() spider.

  3. Update the URLs, form selector, and credential field names to match the target application before running the spider.

    Some frameworks name the token field csrfmiddlewaretoken or authenticity_token instead of csrf_token, and the protected page should be a URL that reliably redirects anonymous users back to the login form.

  4. Run the spider with the credentials passed as spider arguments and export the authenticated page data to JSON.
    $ scrapy crawl account -a username="editor@example.com" -a password="correct-horse-battery-staple" -O account.json
    2026-04-16 05:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.internal.example/login> (referer: None)
    2026-04-16 05:35:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://app.internal.example/account> from <POST https://app.internal.example/login>
    2026-04-16 05:35:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://app.internal.example/account>
    {'account_name': 'Example Account', 'url': 'https://app.internal.example/account'}
    2026-04-16 05:35:20 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: account.json

    Credentials passed with -a password=… can be visible in shell history and process listings, so use a test account or move secret handling into environment variables or a secret store for real crawls.

  5. Open the exported file and confirm the item came from the authenticated page instead of the login form.
    $ cat account.json
    [
    {"account_name": "Example Account", "url": "https://app.internal.example/account"}
    ]