Password-protected pages block a crawl until the spider can create the same authenticated session that a browser would create after a successful sign-in. Handling that login step inside Scrapy is what turns account dashboards, member-only listings, and private detail pages from redirect loops into normal parseable responses.

Current Scrapy login flows usually start by requesting the login form, then submitting that live form with FormRequest.from_response() so the request keeps the form action, field names, and returned session cookies aligned with the site. After the site sets the login cookie, later requests in the same crawl keep using that authenticated cookie jar automatically through CookiesMiddleware.

Plain password logins still vary by selector names, redirect behavior, and post-login proof points, and some targets add hidden CSRF fields, JavaScript-generated payloads, CAPTCHA, or multi-factor prompts that need a different workflow. Pass credentials without committing them to source control, use a safe test account when possible, and treat repeated login failures as a sign to re-check the live form instead of retrying the same payload blindly.

Steps to authenticate with a password in Scrapy:

  1. Create a new Scrapy project for the authenticated spider.
    $ scrapy startproject accountcrawl
    New Scrapy project 'accountcrawl', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
        /home/user/accountcrawl
    ##### snipped #####
  2. Change into the project directory.
    $ cd accountcrawl
  3. Generate a spider skeleton for the login domain.
    $ scrapy genspider account members.example.com
    Created spider 'account' using template 'basic' in module:
      accountcrawl.spiders.account
  4. Inspect the live login form in scrapy shell so the spider uses the correct selector and field names.
    $ scrapy shell "https://members.example.com/login" --nolog
    >>> response.css('form#login-form input::attr(name)').getall()
    ['email', 'password']
    >>> response.css('form#login-form::attr(action)').get()
    '/login'

    FormRequest.from_response() only works cleanly when the spider targets the right form, so inspect the live page first and use formcss, formid, or formname when more than one form is present. Related: How to submit a form in a Scrapy spider

  5. Replace accountcrawl/spiders/account.py with a login-first spider that submits the password form and follows an authenticated page.
    import scrapy
    from scrapy.exceptions import CloseSpider
    from scrapy.http import FormRequest
     
     
    class AccountSpider(scrapy.Spider):
        name = "account"
        allowed_domains = ["members.example.com"]
     
        login_url = "https://members.example.com/login"
        account_url = "https://members.example.com/account"
     
        async def start(self):
            if not getattr(self, "email", None) or not getattr(self, "password", None):
                raise CloseSpider("Pass -a email=... -a password=...")
     
            yield scrapy.Request(
                self.login_url,
                callback=self.parse_login,
                dont_filter=True,
            )
     
        def parse_login(self, response):
            yield FormRequest.from_response(
                response,
                formcss="form#login-form",
                formdata={
                    "email": self.email,
                    "password": self.password,
                },
                callback=self.after_login,
                dont_filter=True,
            )
     
        def after_login(self, response):
            if "login" in response.url.lower():
                raise CloseSpider("Login failed; the response stayed on the login page.")
     
            yield response.follow(
                self.account_url,
                callback=self.parse_account,
                dont_filter=True,
            )
     
        def parse_account(self, response):
            if "login" in response.url.lower():
                raise CloseSpider("Protected page redirected back to login.")
     
            yield {
                "account_name": response.css("h1::text").get(default="").strip(),
                "status": response.css(".status::text").get(default="").strip(),
                "url": response.url,
            }

    Current Scrapy releases use async def start() as the primary entry point, and a compatibility start_requests() method is only needed when maintaining spiders for Scrapy versions older than 2.13.

  6. Update the URLs, form selector, field names, and protected-page selectors so they match the target site exactly.

    If the login page includes a hidden CSRF token or other hidden state, keep using FormRequest.from_response() and move to How to authenticate with a CSRF login form in Scrapy instead of stripping those fields out by hand.

  7. Run the spider with the login credentials passed as spider arguments and overwrite the JSON export file.
    $ scrapy crawl account -a email="editor@example.com" -a password="correct-horse-battery-staple" --overwrite-output account.json
    2026-04-16 06:20:17 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: account.json

    Credentials passed on the command line can be written to shell history or exposed in process listings, so use a test account or move secret loading into environment variables or a secret manager for real crawls.

  8. Open the exported file and confirm the item came from the protected page instead of the login form.
    $ cat account.json
    [
    {"account_name": "Example Account", "status": "Authenticated area", "url": "https://members.example.com/account"}
    ]