Password-protected pages block crawlers until an authenticated session exists, which prevents scraping account-only content and can cause endless redirects back to the login form. Handling login inside a Scrapy spider enables extraction from member areas, dashboards, and customer portals while keeping requests aligned with normal browser behavior.
Most sites authenticate by accepting credentials via an HTTP form POST and returning a session cookie. Scrapy can submit the same form with FormRequest and maintain the resulting cookies automatically through its cookie middleware, so later requests reuse the logged-in session.
Login flows often include hidden inputs and CSRF tokens, and some sites add CAPTCHAs or multi-factor prompts that cannot be completed with a basic password request. Credentials should be supplied without committing them to source control, and the crawl must respect rate limits and account lockout policies to avoid disabling the account.
$ scrapy startproject authlogin
New Scrapy project 'authlogin', using template directory '/usr/lib/python3/dist-packages/scrapy/templates/project', created in:
/root/sg-work/authlogin
##### snipped #####
$ cd authlogin
$ scrapy genspider login app.internal.example Created spider 'login' using template 'basic' in module: authlogin.spiders.login
$ scrapy shell "http://app.internal.example:8000/login"
>>> response.css('form input::attr(name)').getall()
['csrf_token', 'email', 'password']
>>> response.css('form::attr(action)').get()
'/login'
import scrapy from scrapy.exceptions import CloseSpider from scrapy.http import FormRequest class LoginSpider(scrapy.Spider): name = "login" allowed_domains = ["app.internal.example"] login_url = "http://app.internal.example:8000/login" protected_url = "http://app.internal.example:8000/account" custom_settings = { "COOKIES_ENABLED": True, } def __init__(self, email=None, password=None, *args, **kwargs): super().__init__(*args, **kwargs) if not email or not password: raise CloseSpider("Missing spider arguments: -a email=... -a password=...") self.email = email self.password = password def start_requests(self): yield scrapy.Request(self.login_url, callback=self.parse_login, dont_filter=True) def parse_login(self, response): return FormRequest.from_response( response, formdata={ "email": self.email, "password": self.password, }, callback=self.after_login, dont_filter=True, ) def after_login(self, response): yield scrapy.Request( self.protected_url, callback=self.parse_protected, dont_filter=True, ) def parse_protected(self, response): if "login" in response.url.lower(): raise CloseSpider("Login failed (protected URL redirected to login page).") yield { "url": response.url, "page_title": response.css("title::text").get(default="").strip(), }
Keys in formdata must match the login form input name attributes (for example username vs email), and the protected URL should be a page that reliably requires authentication. FormRequest.from_response keeps hidden inputs found in the HTML form (common for CSRF tokens), and a specific form can be selected with parameters such as formid or formnumber when multiple forms exist on the page.
$ scrapy crawl login -a email="account@example.com" -a password="correct-horse-battery-staple" -O items.json 2026-01-01 09:41:42 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
Passwords passed on the command line may be stored in shell history or visible to other users via process listings.
$ python -m json.tool items.json
[
{
"account_name": "Example Account",
"url": "http://app.internal.example:8000/account"
}
]