Real estate results pages usually expose only teaser data such as the property title, thumbnail price, and the link to the full listing. The address, bedroom count, and other durable fields often live only on the detail page, so a spider that stops at the results cards exports incomplete records.
Current Scrapy releases fit this pattern well because scrapy shell can confirm the list-page and detail-page selectors before any crawl starts, response.follow() accepts the relative detail links directly, and cb_kwargs can carry teaser values into the detail callback when the final page omits or rewrites them.
New Scrapy projects now start with polite defaults such as ROBOTSTXT_OBEY = True, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_DELAY = 1, and FEED_EXPORT_ENCODING = "utf-8", but real estate sites still change markup and can load fields after JavaScript runs. If the property data is missing from the downloaded HTML, switch to an API or browser-rendered workflow instead of widening selectors until unrelated page elements match.
Related: How to scrape paginated pages with Scrapy
Related: How to use Scrapy shell
$ scrapy startproject real_estate
New Scrapy project 'real_estate', using template directory '##### snipped #####', created in:
/home/user/real_estate
You can start your first spider with:
cd real_estate
scrapy genspider example example.com
$ cd real_estate
$ scrapy genspider homes property.example Created spider 'homes' using template 'basic' in module: real_estate.spiders.homes
ROBOTSTXT_OBEY = True CONCURRENT_REQUESTS_PER_DOMAIN = 1 DOWNLOAD_DELAY = 1 FEED_EXPORT_ENCODING = "utf-8"
Current Scrapy project templates seed these values by default. Enable AUTOTHROTTLE only when the target site needs adaptive backoff. Related: How to enable AutoThrottle in Scrapy
Related: How to set a download delay in Scrapy
$ scrapy shell 'https://property.example/real-estate/' --nolog
[s] Available Scrapy objects:
[s] response <200 https://property.example/real-estate/>
##### snipped #####
>>> response.css("article.listing a.detail::attr(href)").getall()
['/real-estate/lakeside-cabin.html', '/real-estate/downtown-loft.html']
Keep the selector anchored to the listing container so banners, filter controls, and promoted widgets do not widen the crawl. Related: How to use CSS selectors in Scrapy
$ scrapy shell 'https://property.example/real-estate/lakeside-cabin.html' --nolog
[s] Available Scrapy objects:
[s] response <200 https://property.example/real-estate/lakeside-cabin.html>
##### snipped #####
>>> response.css("h1::text").get()
'Lakeside Cabin'
>>> response.css(".city::text").get()
'Pine Lake'
>>> response.css(".beds::text").get()
'3'
Test the selectors against the HTML that Scrapy downloads, not the browser DOM after scripts run. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
import scrapy class HomesSpider(scrapy.Spider): name = "homes" allowed_domains = ["property.example"] start_urls = ["https://property.example/real-estate/"] def parse(self, response): for card in response.css("article.listing"): href = card.css("a.detail::attr(href)").get() preview_price = card.css("span.price::text").get(default="").strip() preview_title = card.css("h2::text").get(default="").strip() if href: yield response.follow( href, callback=self.parse_listing, cb_kwargs={ "preview_price": preview_price, "preview_title": preview_title, }, ) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_listing(self, response, preview_price, preview_title): yield { "listing_id": response.url.rstrip("/").split("/")[-1].replace(".html", ""), "title": response.css("h1::text").get(default=preview_title).strip(), "price": response.css(".price::text").get(default=preview_price).strip(), "city": response.css(".city::text").get(default="").strip(), "beds": response.css(".beds::text").get(default="").strip(), "url": response.url, }
Skip cards without a real property URL and keep one stable dedup field such as listing_id or url, otherwise promoted blocks or repeated listings can quietly poison the export.
$ scrapy crawl homes -O homes.json
2026-04-22 05:54:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://property.example/real-estate/> (referer: None)
2026-04-22 05:54:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://property.example/real-estate/page-2.html> (referer: https://property.example/real-estate/)
2026-04-22 05:54:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://property.example/real-estate/downtown-loft.html>
{'listing_id': 'downtown-loft', 'title': 'Downtown Loft', 'price': '$685,000', 'city': 'River City', 'beds': '2', 'url': 'https://property.example/real-estate/downtown-loft.html'}
##### snipped #####
2026-04-22 05:54:19 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: homes.json
2026-04-22 05:54:19 [scrapy.core.engine] INFO: Spider closed (finished)
-O is the short form of –overwrite-output, so each test run replaces the previous local export instead of appending stale records.
$ cat homes.json
[
{"listing_id": "downtown-loft", "title": "Downtown Loft", "price": "$685,000", "city": "River City", "beds": "2", "url": "https://property.example/real-estate/downtown-loft.html"},
{"listing_id": "lakeside-cabin", "title": "Lakeside Cabin", "price": "$420,000", "city": "Pine Lake", "beds": "3", "url": "https://property.example/real-estate/lakeside-cabin.html"},
{"listing_id": "garden-bungalow", "title": "Garden Bungalow", "price": "$510,000", "city": "Westfield", "beds": "4", "url": "https://property.example/real-estate/garden-bungalow.html"}
]
If city, beds, or other detail fields stay empty even though the browser shows them, inspect the page's network requests or move to a rendered workflow. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
Related: How to use Selenium with Scrapy