Real estate catalog pages rarely expose the full property record on the search results page, so scraping only the visible cards leaves out fields such as address, bedroom count, agent details, or listing text. A listing-to-detail spider fixes that by treating the results page as a queue of property URLs and the detail page as the source of the final item data.

Scrapy fits this pattern by selecting each property card, extracting the detail link, and yielding response.follow() requests that call a second callback for the property page. Those follow-up requests can also carry list-page values with cb_kwargs, which is the preferred way to merge teaser fields such as title or summary price into the final detail item.

Real estate sites often rotate CSS classes, mix promoted cards with real listings, and apply rate limits or robots.txt rules that interrupt a crawl long before the spider crashes visibly. Test selectors in scrapy shell first, keep throttling conservative, and use a stable identifier such as the property URL or site listing ID so repeated crawls can deduplicate the same home across runs.

Steps to scrape real estate listings with detail pages in Scrapy:

  1. Create a new Scrapy project for the real estate spider.
    $ scrapy startproject real_estate
    New Scrapy project 'real_estate', using template directory '##### snipped #####', created in:
        /root/sg-work/real_estate
    
    You can start your first spider with:
        cd real_estate
        scrapy genspider example example.com
  2. Change to the new project directory.
    $ cd real_estate
  3. Generate a basic spider for the property site.
    $ scrapy genspider homes app.internal.example
    Created spider 'homes' using template 'basic' in module:
      real_estate.spiders.homes
  4. Open scrapy shell against a results page and confirm the selector returns property detail URLs.
    $ scrapy shell 'http://app.internal.example:8000/real-estate/'
    [s] Available Scrapy objects:
    [s]   response   <200 http://app.internal.example:8000/real-estate/>
    ##### snipped #####
    >>> response.css("article.listing a.detail::attr(href)").getall()
    ['/real-estate/downtown-loft.html', '/real-estate/lakeside-cabin.html']
  5. Open scrapy shell against one property page and confirm the detail-only fields.
    $ scrapy shell 'http://app.internal.example:8000/real-estate/lakeside-cabin.html'
    [s] Available Scrapy objects:
    [s]   response   <200 http://app.internal.example:8000/real-estate/lakeside-cabin.html>
    ##### snipped #####
    >>> response.css("h1::text").get()
    'Lakeside Cabin'
    >>> response.css(".city::text").get()
    'Pine Lake'
    >>> response.css(".beds::text").get()
    '3'

    Keep selectors anchored to headings, labels, semantic containers, or stable data attributes instead of short-lived class hashes.

  6. Replace the generated spider with a listing-to-detail spider that follows property links, passes preview values with cb_kwargs, and queues pagination from the results page.
    import scrapy
     
     
    class HomesSpider(scrapy.Spider):
        name = "homes"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/real-estate/"]
     
        custom_settings = {
            "ROBOTSTXT_OBEY": True,
            "DOWNLOAD_DELAY": 1.0,
            "AUTOTHROTTLE_ENABLED": True,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
        }
     
        def parse(self, response):
            for card in response.css("article.listing"):
                href = card.css("a.detail::attr(href)").get()
                preview_price = card.css("span.price::text").get(default="").strip()
                preview_title = card.css("h2::text").get(default="").strip()
     
                if href:
                    yield response.follow(
                        href,
                        callback=self.parse_listing,
                        cb_kwargs={
                            "preview_price": preview_price,
                            "preview_title": preview_title,
                        },
                    )
     
            next_href = response.css("a.next::attr(href)").get()
            if next_href:
                yield response.follow(next_href, callback=self.parse)
     
        def parse_listing(self, response, preview_price, preview_title):
            yield {
                "listing_id": response.url.rstrip("/").split("/")[-1].replace(".html", ""),
                "title": response.css("h1::text").get(default=preview_title).strip(),
                "price": response.css(".price::text").get(default=preview_price).strip(),
                "city": response.css(".city::text").get(default="").strip(),
                "beds": response.css(".beds::text").get(default="").strip(),
                "url": response.url,
            }

    Promoted cards, missing detail URLs, or duplicated property links can quietly poison the dataset, so ignore cards without a real property URL and keep one stable dedup field such as listing_id or url.

  7. Run the spider and overwrite the previous JSON export on each crawl.
    $ scrapy crawl homes -O homes.json
    ##### snipped #####
    2026-04-16 06:19:03 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: homes.json
    2026-04-16 06:19:03 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites the local export file, which keeps repeated test runs from appending stale items.

  8. Open the exported JSON and confirm each item contains fields assembled from the list page plus the property detail page.
    $ cat homes.json
    [
    {"listing_id": "lakeside-cabin", "title": "Lakeside Cabin", "price": "$420,000", "city": "Pine Lake", "beds": "3", "url": "http://app.internal.example:8000/real-estate/lakeside-cabin.html"},
    {"listing_id": "downtown-loft", "title": "Downtown Loft", "price": "$685,000", "city": "River City", "beds": "2", "url": "http://app.internal.example:8000/real-estate/downtown-loft.html"},
    {"listing_id": "garden-bungalow", "title": "Garden Bungalow", "price": "$510,000", "city": "Westfield", "beds": "4", "url": "http://app.internal.example:8000/real-estate/garden-bungalow.html"}
    ]

    Keeping the test export small and readable makes it easier to spot partial items before the spider is pointed at a larger results set.

Notes

  • Scrapy recommends cb_kwargs for passing user data between callbacks, while meta is better reserved for request metadata used by middlewares or extensions.
  • response.follow() accepts relative property and pagination links directly, so a separate response.urljoin() call is not required for normal HTML href values.
  • Add fields such as address, postal code, agent name, or listing text in parse_listing() only after confirming they exist in the HTML returned to Scrapy.
  • Use How to scrape a JavaScript-rendered page with Scrapy using Playwright or How to use Selenium with Scrapy when the property data is injected after page load and the raw response does not contain the needed markup.