How to scrape real estate listings with detail pages in Scrapy

Real estate results pages usually expose only teaser data such as the property title, thumbnail price, and the link to the full listing. The address, bedroom count, and other durable fields often live only on the detail page, so a spider that stops at the results cards exports incomplete records.

Current Scrapy releases fit this pattern well because scrapy shell can confirm the list-page and detail-page selectors before any crawl starts, response.follow() accepts the relative detail links directly, and cb_kwargs can carry teaser values into the detail callback when the final page omits or rewrites them.

New Scrapy projects now start with polite defaults such as ROBOTSTXT_OBEY = True, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_DELAY = 1, and FEED_EXPORT_ENCODING = "utf-8", but real estate sites still change markup and can load fields after JavaScript runs. If the property data is missing from the downloaded HTML, switch to an API or browser-rendered workflow instead of widening selectors until unrelated page elements match.

Steps to scrape real estate listings with detail pages in Scrapy:

Create a new Scrapy project for the listing-to-detail spider.

$ scrapy startproject real_estate
New Scrapy project 'real_estate', using template directory '##### snipped #####', created in:
    /home/user/real_estate

You can start your first spider with:
    cd real_estate
    scrapy genspider example example.com

Change to the new project directory.
```
$ cd real_estate
```

Generate a basic spider for the property site.

$ scrapy genspider homes property.example
Created spider 'homes' using template 'basic' in module:
  real_estate.spiders.homes

Review the default crawl settings in real_estate/settings.py before raising request rates.
real_estate/settings.py
```
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 1
FEED_EXPORT_ENCODING = "utf-8"
```
Current Scrapy project templates seed these values by default. Enable AUTOTHROTTLE only when the target site needs adaptive backoff. Related: How to enable AutoThrottle in Scrapy
Related: How to set a download delay in Scrapy

Start scrapy shell against a results page and confirm the detail-link selector.

$ scrapy shell 'https://property.example/real-estate/' --nolog
[s] Available Scrapy objects:
[s]   response   <200 https://property.example/real-estate/>
##### snipped #####
>>> response.css("article.listing a.detail::attr(href)").getall()
['/real-estate/lakeside-cabin.html', '/real-estate/downtown-loft.html']

Keep the selector anchored to the listing container so banners, filter controls, and promoted widgets do not widen the crawl. Related: How to use CSS selectors in Scrapy

Start scrapy shell against one property page and confirm the detail-only fields.

$ scrapy shell 'https://property.example/real-estate/lakeside-cabin.html' --nolog
[s] Available Scrapy objects:
[s]   response   <200 https://property.example/real-estate/lakeside-cabin.html>
##### snipped #####
>>> response.css("h1::text").get()
'Lakeside Cabin'
>>> response.css(".city::text").get()
'Pine Lake'
>>> response.css(".beds::text").get()
'3'

Test the selectors against the HTML that Scrapy downloads, not the browser DOM after scripts run. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright

Replace real_estate/spiders/homes.py with a listing-to-detail spider that follows each property URL, carries teaser values with cb_kwargs, and queues the next results page when present.

real_estate/spiders/homes.py

import scrapy
 
 
class HomesSpider(scrapy.Spider):
    name = "homes"
    allowed_domains = ["property.example"]
    start_urls = ["https://property.example/real-estate/"]
 
    def parse(self, response):
        for card in response.css("article.listing"):
            href = card.css("a.detail::attr(href)").get()
            preview_price = card.css("span.price::text").get(default="").strip()
            preview_title = card.css("h2::text").get(default="").strip()
 
            if href:
                yield response.follow(
                    href,
                    callback=self.parse_listing,
                    cb_kwargs={
                        "preview_price": preview_price,
                        "preview_title": preview_title,
                    },
                )
 
        next_href = response.css("a.next::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)
 
    def parse_listing(self, response, preview_price, preview_title):
        yield {
            "listing_id": response.url.rstrip("/").split("/")[-1].replace(".html", ""),
            "title": response.css("h1::text").get(default=preview_title).strip(),
            "price": response.css(".price::text").get(default=preview_price).strip(),
            "city": response.css(".city::text").get(default="").strip(),
            "beds": response.css(".beds::text").get(default="").strip(),
            "url": response.url,
        }

Skip cards without a real property URL and keep one stable dedup field such as listing_id or url, otherwise promoted blocks or repeated listings can quietly poison the export.

Run the spider and overwrite the previous JSON export on each crawl.

$ scrapy crawl homes -O homes.json
2026-04-22 05:54:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://property.example/real-estate/> (referer: None)
2026-04-22 05:54:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://property.example/real-estate/page-2.html> (referer: https://property.example/real-estate/)
2026-04-22 05:54:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://property.example/real-estate/downtown-loft.html>
{'listing_id': 'downtown-loft', 'title': 'Downtown Loft', 'price': '$685,000', 'city': 'River City', 'beds': '2', 'url': 'https://property.example/real-estate/downtown-loft.html'}
##### snipped #####
2026-04-22 05:54:19 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: homes.json
2026-04-22 05:54:19 [scrapy.core.engine] INFO: Spider closed (finished)

-O is the short form of –overwrite-output, so each test run replaces the previous local export instead of appending stale records.

Open the exported JSON and confirm each item combines the list-page teaser fields with the detail-page fields.

$ cat homes.json
[
{"listing_id": "downtown-loft", "title": "Downtown Loft", "price": "$685,000", "city": "River City", "beds": "2", "url": "https://property.example/real-estate/downtown-loft.html"},
{"listing_id": "lakeside-cabin", "title": "Lakeside Cabin", "price": "$420,000", "city": "Pine Lake", "beds": "3", "url": "https://property.example/real-estate/lakeside-cabin.html"},
{"listing_id": "garden-bungalow", "title": "Garden Bungalow", "price": "$510,000", "city": "Westfield", "beds": "4", "url": "https://property.example/real-estate/garden-bungalow.html"}
]

If city, beds, or other detail fields stay empty even though the browser shows them, inspect the page's network requests or move to a rendered workflow. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
Related: How to use Selenium with Scrapy