How to scrape real estate listings with detail pages in Scrapy

Real estate websites often split information between search results cards and a dedicated property page, which makes it easy to miss key fields when collecting data from only one view. Capturing both the list-level snapshot and the detail-level attributes produces a dataset that supports price tracking, inventory comparisons, and change detection.

Scrapy fits this pattern by extracting property detail URLs from a listing results page, scheduling follow-up requests for each detail page, and yielding structured items from the detail responses. Pagination links from the results page can be followed to cover an entire search area while keeping requests scoped to the intended domain.

Selectors often need regular maintenance because property sites change HTML markup, rotate CSS classes, or insert sponsored cards that resemble listings. Many sites also enforce robots.txt, terms of use, and rate limits; aggressive crawling can trigger captchas or blocks, and collecting personal information may carry compliance obligations. Prefer stable identifiers (URL or listing ID) for de-duplication and treat extracted values as a point-in-time snapshot.

Steps to scrape real estate listings with detail pages in Scrapy:

Generate a spider for the property listing domain.

$ scrapy genspider homes app.internal.example
Created spider 'homes' using template 'basic' in module:
  real_estate.spiders.homes

Probe a listing results page in scrapy shell to confirm the selector that returns property detail URLs.

$ scrapy shell "http://app.internal.example:8000/real-estate/"
>>> response.css("article.listing a::attr(href)").getall()[:3]
['/real-estate/lakeside-cabin.html', '/real-estate/downtown-loft.html']

Selector output should contain on-site detail paths rather than ad redirects.

Probe a property detail page in scrapy shell to confirm selectors for the fields to extract.
```
$ scrapy shell "http://app.internal.example:8000/real-estate/lakeside-cabin.html"
>>> response.css(".price::text").get()
'$420,000'
>>> response.css("h1::text").get()
'Lakeside Cabin'
```
Prefer stable selectors (semantic attributes, headings, labels) over auto-generated class names when available.

Replace the generated spider with a listing-to-detail spider that follows card links plus pagination while yielding structured items.

import scrapy
 
class HomesSpider(scrapy.Spider):
    name = "homes"
    allowed_domains = ["app.internal.example"]
    start_urls = ["http://app.internal.example:8000/real-estate/"]
 
    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "DOWNLOAD_DELAY": 1.0,
        "AUTOTHROTTLE_ENABLED": True,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
    }
 
    def parse(self, response):
        for href in response.css("article.listing a::attr(href)").getall():
            yield response.follow(href, callback=self.parse_listing)
 
        next_href = response.css("a.next::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)
 
    def parse_listing(self, response):
        yield {
            "listing_id": response.url.rstrip("/").split("/")[-1].replace(".html", ""),
            "price": response.css(".price::text").get(default="").strip(),
            "title": response.css("h1::text").get(default="").strip(),
            "city": response.css(".city::text").get(default="").strip(),
            "url": response.url,
        }

Ignoring site terms or crawling too aggressively can trigger CAPTCHA challenges, IP blocks, or account lockouts.

Run the spider with JSON feed export enabled.

$ scrapy crawl homes -O homes.json
2026-01-01 09:47:55 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: homes.json

Use -O to overwrite an existing output file, or -o to append.

Validate the exported JSON for item count plus expected keys.

$ python -c "import json; print(len(json.load(open('homes.json'))))"
2
$ python -c "import json; print(sorted(json.load(open('homes.json'))[0].keys()))"
['city', 'listing_id', 'price', 'title', 'url']
$ python -c "import json; print(json.load(open('homes.json'))[0])"
{'listing_id': 'downtown-loft', 'price': '$685,000', 'title': 'Downtown Loft', 'city': 'River City', 'url': 'http://app.internal.example:8000/real-estate/downtown-loft.html'}

Keeping listing_id or url stable across runs simplifies de-duplication and change tracking.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.