Password-protected pages block crawlers until an authenticated session exists, which prevents scraping account-only content and can cause endless redirects back to the login form. Handling login inside a Scrapy spider enables extraction from member areas, dashboards, and customer portals while keeping requests aligned with normal browser behavior.
Most sites authenticate by accepting credentials via an HTTP form POST and returning a session cookie. Scrapy can submit the same form with FormRequest and maintain the resulting cookies automatically through its cookie middleware, so later requests reuse the logged-in session.
Login flows often include hidden inputs and CSRF tokens, and some sites add CAPTCHAs or multi-factor prompts that cannot be completed with a basic password request. Credentials should be supplied without committing them to source control, and the crawl must respect rate limits and account lockout policies to avoid disabling the account.
Steps to authenticate with a password in Scrapy:
- Create a new Scrapy project for the authenticated crawler.
$ scrapy startproject authlogin New Scrapy project 'authlogin', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in: /root/sg-work/authlogin ##### snipped ##### - Change into the project directory.
$ cd authlogin
- Generate a spider skeleton for the target domain.
$ scrapy genspider login app.internal.example Created spider 'login' using template 'basic' in module: authlogin.spiders.login
- Inspect the login form input names using scrapy shell.
$ scrapy shell "http://app.internal.example:8000/login" >>> response.css('form input::attr(name)').getall() ['csrf_token', 'email', 'password'] >>> response.css('form::attr(action)').get() '/login' - Replace the generated spider in authlogin/spiders/login.py with a login-first spider that submits the password form.
import scrapy from scrapy.exceptions import CloseSpider from scrapy.http import FormRequest class LoginSpider(scrapy.Spider): name = "login" allowed_domains = ["app.internal.example"] login_url = "http://app.internal.example:8000/login" protected_url = "http://app.internal.example:8000/account" custom_settings = { "COOKIES_ENABLED": True, } def __init__(self, email=None, password=None, *args, **kwargs): super().__init__(*args, **kwargs) if not email or not password: raise CloseSpider("Missing spider arguments: -a email=... -a password=...") self.email = email self.password = password def start_requests(self): yield scrapy.Request(self.login_url, callback=self.parse_login, dont_filter=True) def parse_login(self, response): return FormRequest.from_response( response, formdata={ "email": self.email, "password": self.password, }, callback=self.after_login, dont_filter=True, ) def after_login(self, response): yield scrapy.Request( self.protected_url, callback=self.parse_protected, dont_filter=True, ) def parse_protected(self, response): if "login" in response.url.lower(): raise CloseSpider("Login failed (protected URL redirected to login page).") yield { "url": response.url, "page_title": response.css("title::text").get(default="").strip(), }
- Update the spider URLs and form field names to match the target site.
Keys in formdata must match the login form input name attributes (for example username vs email), and the protected URL should be a page that reliably requires authentication. FormRequest.from_response keeps hidden inputs found in the HTML form (common for CSRF tokens), and a specific form can be selected with parameters such as formid or formnumber when multiple forms exist on the page.
- Run the spider with credentials passed as spider arguments.
$ scrapy crawl login -a email="account@example.com" -a password="correct-horse-battery-staple" -O items.json 2026-01-01 09:41:42 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
Passwords passed on the command line may be stored in shell history or visible to other users via process listings.
- Verify the exported item output includes the expected protected URL and page title.
$ python -m json.tool items.json [ { "account_name": "Example Account", "url": "http://app.internal.example:8000/account" } ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
