CSS selectors are the quickest way to target HTML elements in a Scrapy response when the page structure is already visible in browser developer tools or the returned markup. Testing selectors before adding them to a spider keeps extraction rules shorter and makes layout regressions easier to spot.
Each HTML response exposes response.css() and returns a SelectorList that can be narrowed further or converted into values with get() and getall(). Scrapy extends browser-style CSS with ::text for text nodes and ::attr(name) for attribute values, so the selector tested in scrapy shell can usually move into parse() with only minor cleanup code.
CSS selectors still operate on the downloaded response body, not a browser-rendered DOM, and Scrapy's pseudo-elements are specific to Scrapy and parsel rather than general browser CSS. A selector can also match an element whose text node is empty, so cleanup code should use defaults before calling methods such as strip().
Related: How to use Scrapy shell
Related: How to scrape an HTML table with Scrapy
Steps to use CSS selectors in Scrapy:
- Open a terminal in the Scrapy project directory or another working directory for a test spider.
$ cd /root/sg-work/example
- Start scrapy shell with a page that has predictable sample markup.
$ scrapy shell 'https://docs.scrapy.org/en/latest/_static/selectors-sample1.html' --nolog [s] Available Scrapy objects: [s] response <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) ##### snipped ##### >>>
The shell exposes the fetched page as response, so selectors can be tested against the same response object a spider callback receives.
- Confirm the response body matches the expected page before extracting fields.
>>> response.css("title::text").get() 'Example website'get() returns the first match or None when nothing matches, while getall() always returns a list.
- Extract repeated attribute values or text nodes from the matched elements.
>>> response.css("#images a::attr(href)").getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.css("#images a::text").getall() ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']::attr(href) and ::text are Scrapy-specific pseudo-elements, so they work in Scrapy selectors but not in standard browser CSS selectors.
- Save the verified selectors in a spider so each matched link yields a cleaned label and an absolute URL.
- selectors_spider.py
import scrapy class SelectorsSpider(scrapy.Spider): name = "selectors" start_urls = [ "https://docs.scrapy.org/en/latest/_static/selectors-sample1.html", ] def parse(self, response): for link in response.css("#images a"): href = link.css("::attr(href)").get() yield { "label": link.css("::text").get(default="").strip(), "href": response.urljoin(href) if href else None, }
link.css("::text").get() can return None even when the element exists, so use default="" before strip() when text might be empty.
Related: How to create a Scrapy spider
- Run the spider and confirm the extracted items are populated with the expected values.
$ scrapy runspider --nolog --output -:json selectors_spider.py [ {"label": "Name: My image 1", "href": "http://example.com/image1.html"}, {"label": "Name: My image 2", "href": "http://example.com/image2.html"}, {"label": "Name: My image 3", "href": "http://example.com/image3.html"}, {"label": "Name: My image 4", "href": "http://example.com/image4.html"}, {"label": "Name: My image 5", "href": "http://example.com/image5.html"} ]response.urljoin() resolves the relative href values against the page base URL, so the saved item output is immediately usable outside the spider.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
