Password-protected pages block crawlers until an authenticated session exists, which prevents scraping account-only content and can cause endless redirects back to the login form. Handling login inside a Scrapy spider enables extraction from member areas, dashboards, and customer portals while keeping requests aligned with normal browser behavior.

Most sites authenticate by accepting credentials via an HTTP form POST and returning a session cookie. Scrapy can submit the same form with FormRequest and maintain the resulting cookies automatically through its cookie middleware, so later requests reuse the logged-in session.

Login flows often include hidden inputs and CSRF tokens, and some sites add CAPTCHAs or multi-factor prompts that cannot be completed with a basic password request. Credentials should be supplied without committing them to source control, and the crawl must respect rate limits and account lockout policies to avoid disabling the account.

Steps to authenticate with a password in Scrapy:

  1. Create a new Scrapy project for the authenticated crawler.
    $ scrapy startproject authlogin
    New Scrapy project 'authlogin', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
        /root/sg-work/authlogin
    ##### snipped #####
  2. Change into the project directory.
    $ cd authlogin
  3. Generate a spider skeleton for the target domain.
    $ scrapy genspider login app.internal.example
    Created spider 'login' using template 'basic' in module:
      authlogin.spiders.login
  4. Inspect the login form input names using scrapy shell.
    $ scrapy shell "http://app.internal.example:8000/login"
    >>> response.css('form input::attr(name)').getall()
    ['csrf_token', 'email', 'password']
    >>> response.css('form::attr(action)').get()
    '/login'
  5. Replace the generated spider in authlogin/spiders/login.py with a login-first spider that submits the password form.
    import scrapy
    from scrapy.exceptions import CloseSpider
    from scrapy.http import FormRequest
     
     
    class LoginSpider(scrapy.Spider):
        name = "login"
        allowed_domains = ["app.internal.example"]
     
        login_url = "http://app.internal.example:8000/login"
        protected_url = "http://app.internal.example:8000/account"
     
        custom_settings = {
            "COOKIES_ENABLED": True,
        }
     
        def __init__(self, email=None, password=None, *args, **kwargs):
            super().__init__(*args, **kwargs)
            if not email or not password:
                raise CloseSpider("Missing spider arguments: -a email=... -a password=...")
            self.email = email
            self.password = password
     
        def start_requests(self):
            yield scrapy.Request(self.login_url, callback=self.parse_login, dont_filter=True)
     
        def parse_login(self, response):
            return FormRequest.from_response(
                response,
                formdata={
                    "email": self.email,
                    "password": self.password,
                },
                callback=self.after_login,
                dont_filter=True,
            )
     
        def after_login(self, response):
            yield scrapy.Request(
                self.protected_url,
                callback=self.parse_protected,
                dont_filter=True,
            )
     
        def parse_protected(self, response):
            if "login" in response.url.lower():
                raise CloseSpider("Login failed (protected URL redirected to login page).")
            yield {
                "url": response.url,
                "page_title": response.css("title::text").get(default="").strip(),
            }
  6. Update the spider URLs and form field names to match the target site.

    Keys in formdata must match the login form input name attributes (for example username vs email), and the protected URL should be a page that reliably requires authentication. FormRequest.from_response keeps hidden inputs found in the HTML form (common for CSRF tokens), and a specific form can be selected with parameters such as formid or formnumber when multiple forms exist on the page.

  7. Run the spider with credentials passed as spider arguments.
    $ scrapy crawl login -a email="account@example.com" -a password="correct-horse-battery-staple" -O items.json
    2026-01-01 09:41:42 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json

    Passwords passed on the command line may be stored in shell history or visible to other users via process listings.

  8. Verify the exported item output includes the expected protected URL and page title.
    $ python -m json.tool items.json
    [
        {
            "account_name": "Example Account",
            "url": "http://app.internal.example:8000/account"
        }
    ]