Real estate websites often split information between search results cards and a dedicated property page, which makes it easy to miss key fields when collecting data from only one view. Capturing both the list-level snapshot and the detail-level attributes produces a dataset that supports price tracking, inventory comparisons, and change detection.
Scrapy fits this pattern by extracting property detail URLs from a listing results page, scheduling follow-up requests for each detail page, and yielding structured items from the detail responses. Pagination links from the results page can be followed to cover an entire search area while keeping requests scoped to the intended domain.
Selectors often need regular maintenance because property sites change HTML markup, rotate CSS classes, or insert sponsored cards that resemble listings. Many sites also enforce robots.txt, terms of use, and rate limits; aggressive crawling can trigger captchas or blocks, and collecting personal information may carry compliance obligations. Prefer stable identifiers (URL or listing ID) for de-duplication and treat extracted values as a point-in-time snapshot.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
$ scrapy genspider homes app.internal.example Created spider 'homes' using template 'basic' in module: real_estate.spiders.homes
$ scrapy shell "http://app.internal.example:8000/real-estate/"
>>> response.css("article.listing a::attr(href)").getall()[:3]
['/real-estate/lakeside-cabin.html', '/real-estate/downtown-loft.html']
Selector output should contain on-site detail paths rather than ad redirects.
$ scrapy shell "http://app.internal.example:8000/real-estate/lakeside-cabin.html"
>>> response.css(".price::text").get()
'$420,000'
>>> response.css("h1::text").get()
'Lakeside Cabin'
Prefer stable selectors (semantic attributes, headings, labels) over auto-generated class names when available.
import scrapy class HomesSpider(scrapy.Spider): name = "homes" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/real-estate/"] custom_settings = { "ROBOTSTXT_OBEY": True, "DOWNLOAD_DELAY": 1.0, "AUTOTHROTTLE_ENABLED": True, "CONCURRENT_REQUESTS_PER_DOMAIN": 2, } def parse(self, response): for href in response.css("article.listing a::attr(href)").getall(): yield response.follow(href, callback=self.parse_listing) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_listing(self, response): yield { "listing_id": response.url.rstrip("/").split("/")[-1].replace(".html", ""), "price": response.css(".price::text").get(default="").strip(), "title": response.css("h1::text").get(default="").strip(), "city": response.css(".city::text").get(default="").strip(), "url": response.url, }
Ignoring site terms or crawling too aggressively can trigger CAPTCHA challenges, IP blocks, or account lockouts.
$ scrapy crawl homes -O homes.json 2026-01-01 09:47:55 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: homes.json
Use -O to overwrite an existing output file, or -o to append.
$ python -c "import json; print(len(json.load(open('homes.json'))))"
2
$ python -c "import json; print(sorted(json.load(open('homes.json'))[0].keys()))"
['city', 'listing_id', 'price', 'title', 'url']
$ python -c "import json; print(json.load(open('homes.json'))[0])"
{'listing_id': 'downtown-loft', 'price': '$685,000', 'title': 'Downtown Loft', 'city': 'River City', 'url': 'http://app.internal.example:8000/real-estate/downtown-loft.html'}
Keeping listing_id or url stable across runs simplifies de-duplication and change tracking.