Submitting an HTML form is often the only way to reach search results, filtered listings, member-only pages, and other responses that do not appear as normal crawlable links. A spider that can post the same fields as the browser can move past the form page and scrape the response that actually contains the target data.
HTML forms send named fields to the form action URL with either GET or POST, and many of them include hidden inputs such as CSRF tokens, pagination state, or the clicked submit-button value. Scrapy uses FormRequest.from_response() to build the next request from the live form markup so those fields stay populated while only the search or filter values are overridden.
The form selector, field names, and result selectors must match the live page, and some workflows still fail when the site requires JavaScript-generated values, CAPTCHA, or other browser-only interactions. Test only forms that are safe to automate, and treat account changes, checkout flows, unsubscribe forms, or other state-changing endpoints as unsafe until the target behavior is confirmed.
Steps to submit a form in a Scrapy spider:
- Inspect the live form in scrapy shell so the spider uses the correct action URL and field names.
$ scrapy shell "http://app.internal.example:8000/search" --nolog >>> response.css('form#search-form input[name]::attr(name), form#search-form select[name]::attr(name)').getall() ['csrf_token', 'q', 'category'] >>> response.css('form#search-form::attr(action)').get() '/search' >>> response.css('form#search-form input[name="csrf_token"]::attr(value)').get() 'csrf-demo-123' - Replace simplifiedguide/spiders/search.py with a spider that submits the form built from the response.
import scrapy from scrapy.http import FormRequest class SearchSpider(scrapy.Spider): name = "search" start_urls = ["http://app.internal.example:8000/search"] def parse(self, response): yield FormRequest.from_response( response, formcss="form#search-form", formdata={ "q": "laptop", "category": "all", }, callback=self.parse_results, ) def parse_results(self, response): for product in response.css(".product"): yield { "name": product.css(".name::text").get(default="").strip(), "price": product.css(".price::text").get(default="").strip(), "url": response.urljoin(product.css("a::attr(href)").get()), }
FormRequest.from_response() keeps hidden inputs from the selected form unless they are overridden. Use formcss, formid, formname, or formxpath to target the correct form, add clickdata when a specific submit button value must be sent, and set dont_click=True if the automatic click adds the wrong payload.
Submitting non-idempotent forms can change remote state, so test only against a safe search, filter, or other read-oriented endpoint until the exact request behavior is known.
- Update the formdata keys and the selectors in parse_results() so they match the target site's real name attributes and result markup exactly.
Visible labels and placeholder text do not matter to Scrapy here; the request body is built from the form control name values in the HTML.
- Run the spider and export the submitted-response items to JSON.
$ scrapy crawl search -O results.json 2026-04-16 06:04:35 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: results.json
- Open the exported file and confirm the items came from the form response.
$ cat results.json [ {"name": "Laptop Starter", "price": "$499", "url": "http://app.internal.example:8000/products/starter-plan.html"}, {"name": "Laptop Team", "price": "$899", "url": "http://app.internal.example:8000/products/team-plan.html"} ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
