CSS selectors provide a readable way to target HTML elements when building Scrapy spiders. Matching selectors from browser developer tools to extraction code keeps scraping logic easy to review and faster to iterate. Clear selectors reduce maintenance cost when the target page layout changes.
Scrapy parses each response into selector helpers that expose response.css and response.xpath through parsel. A CSS selector call returns a SelectorList, and pseudo-elements like ::text or ::attr() extract text or attribute values from matched nodes. Scoping selectors to a parent element keeps extraction consistent and prevents unrelated nodes from being mixed into the same field.
Selectors rely on the target HTML structure, so stable attributes like IDs and data-* generally outlive class names meant for styling. Scrapy’s CSS selector support is translated to XPath by the underlying selector engine, so browser-only selector features might not behave the same or be available. Validating selectors against real responses before running a large crawl prevents empty fields and reduces retries caused by brittle selectors.
Related: How to use Scrapy shell
Related: How to scrape an HTML table with Scrapy
Steps to use CSS selectors in Scrapy:
- Test the intended selectors against a representative URL in scrapy shell.
$ scrapy shell 'http://app.internal.example:8000/products/' In [1]: len(response.css("article.product")) Out[1]: 3 In [2]: response.css("article.product h2::text").get() Out[2]: 'Starter Plan' In [3]: response.css("article.product a::attr(href)").get() Out[3]: '/products/starter-plan.html' In [4]: response.urljoin('/products/starter-plan.html') Out[4]: 'http://app.internal.example:8000/products/starter-plan.html'::text extracts text nodes and ::attr(href) extracts an attribute value; get() returns the first match while getall() returns a list.
- Open the spider file for the target site.
$ vi example/spiders/products.py
- Add scoped CSS selectors in the parse method using ::text and ::attr() extractions.
def parse(self, response): for product in response.css("article.product"): href = product.css("a::attr(href)").get() yield { "name": product.css("h2::text").get(default="").strip(), "price": product.css("span.price::text").get(default="").strip(), "url": response.urljoin(href) if href else None, }
Calling strip() on a missing selector result raises an exception; use get(default="") (or handle None) before normalizing text.
- Run the spider to export items to a JSON file.
$ scrapy crawl products -O products.json 2026-01-01 06:35:58 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: products.json
- Review the JSON output to confirm populated fields, including absolute URLs.
$ head -n 5 products.json [ {"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"}, {"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"}, ##### snipped #####
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
