CSS selectors are the quickest way to target HTML elements in a Scrapy response when the page structure is already visible in browser developer tools or the returned markup. Testing selectors before adding them to a spider keeps extraction rules shorter and makes layout regressions easier to spot.

Each HTML response exposes response.css() and returns a SelectorList that can be narrowed further or converted into values with get() and getall(). Scrapy extends browser-style CSS with ::text for text nodes and ::attr(name) for attribute values, so the selector tested in scrapy shell can usually move into parse() with only minor cleanup code.

CSS selectors still operate on the downloaded response body, not a browser-rendered DOM, and Scrapy's pseudo-elements are specific to Scrapy and parsel rather than general browser CSS. A selector can also match an element whose text node is empty, so cleanup code should use defaults before calling methods such as strip().

Steps to use CSS selectors in Scrapy:

  1. Open a terminal in the Scrapy project directory or another working directory for a test spider.
    $ cd /root/sg-work/example
  2. Start scrapy shell with a page that has predictable sample markup.
    $ scrapy shell 'https://docs.scrapy.org/en/latest/_static/selectors-sample1.html' --nolog
    [s] Available Scrapy objects:
    [s]   response   <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    ##### snipped #####
    >>>

    The shell exposes the fetched page as response, so selectors can be tested against the same response object a spider callback receives.

  3. Confirm the response body matches the expected page before extracting fields.
    >>> response.css("title::text").get()
    'Example website'

    get() returns the first match or None when nothing matches, while getall() always returns a list.

  4. Extract repeated attribute values or text nodes from the matched elements.
    >>> response.css("#images a::attr(href)").getall()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    >>> response.css("#images a::text").getall()
    ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

    ::attr(href) and ::text are Scrapy-specific pseudo-elements, so they work in Scrapy selectors but not in standard browser CSS selectors.

  5. Save the verified selectors in a spider so each matched link yields a cleaned label and an absolute URL.
    selectors_spider.py
    import scrapy
     
     
    class SelectorsSpider(scrapy.Spider):
        name = "selectors"
        start_urls = [
            "https://docs.scrapy.org/en/latest/_static/selectors-sample1.html",
        ]
     
        def parse(self, response):
            for link in response.css("#images a"):
                href = link.css("::attr(href)").get()
                yield {
                    "label": link.css("::text").get(default="").strip(),
                    "href": response.urljoin(href) if href else None,
                }

    link.css("::text").get() can return None even when the element exists, so use default="" before strip() when text might be empty.

  6. Run the spider and confirm the extracted items are populated with the expected values.
    $ scrapy runspider --nolog --output -:json selectors_spider.py
    [
    {"label": "Name: My image 1", "href": "http://example.com/image1.html"},
    {"label": "Name: My image 2", "href": "http://example.com/image2.html"},
    {"label": "Name: My image 3", "href": "http://example.com/image3.html"},
    {"label": "Name: My image 4", "href": "http://example.com/image4.html"},
    {"label": "Name: My image 5", "href": "http://example.com/image5.html"}
    ]

    response.urljoin() resolves the relative href values against the page base URL, so the saved item output is immediately usable outside the spider.