Job boards commonly surface only a title, company, and a few tags on the listings index, leaving the full role description and requirements on a separate detail page. Crawling both pages produces complete records that remain useful for search, alerts, and analysis.
Scrapy fits this pattern by extracting each detail URL from the listing response and scheduling follow-up requests via response.follow(). Each detail response is parsed by a dedicated callback that yields one structured item, while the listing callback continues walking pagination until the next-page link disappears.
Markup and URL structures change frequently on job boards, and many sites use relative links or multiple hostnames (for example www and jobs). Keep the crawl constrained to the intended domain, normalize text to remove noisy whitespace, throttle requests to avoid rate limiting, and expect JavaScript-rendered descriptions to require an underlying API endpoint rather than HTML parsing.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
Steps to scrape job listings with detail pages in Scrapy:
- Generate a spider for the job board domain.
$ scrapy genspider jobs app.internal.example Created spider 'jobs' using template 'basic' in module: job_board.spiders.jobs
- Open the spider file for editing.
$ vi job_board/spiders/jobs.py
- Replace the spider code with a two-stage crawl that follows each job card link.
import scrapy def join_clean_text(texts): parts = [] for text in texts: cleaned = text.strip() if cleaned: parts.append(cleaned) return " ".join(parts) class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/jobs/"] custom_settings = { "DOWNLOAD_DELAY": 1.0, "CONCURRENT_REQUESTS_PER_DOMAIN": 2, } def parse(self, response): for card in response.css("article.job"): detail_href = card.css("a::attr(href)").get() if not detail_href: continue yield response.follow(detail_href, callback=self.parse_job) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_job(self, response): description = join_clean_text( response.css("div.job-description ::text").getall() ) yield { "title": response.css("h1::text").get(default="").strip(), "team": response.css(".team::text").get(default="").strip(), "location": response.css(".location::text").get(default="").strip(), "description": description, "url": response.url, }
response.follow() resolves relative href values against the current page URL.
- Update allowed_domains to match the target job board hostname.
Include the exact host used in start_urls, including subdomains such as jobs.example.com.
- Update start_urls to the job listing index page.
- Update the CSS selectors for listing cards, pagination, detail fields.
- Run the spider with JSON feed export enabled.
$ scrapy crawl jobs -O jobs.json 2026-01-01 09:46:44 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: jobs.json
Aggressive crawl rates can trigger temporary blocks, CAPTCHAs, or IP bans on job boards.
- Print one exported record to confirm the detail fields were scraped.
$ python3 - <<'PY' import json with open("jobs.json", "r", encoding="utf-8") as f: items = json.load(f) item = items[0] print(item.get("title", "")) print(item.get("team", "")) print(item.get("location", "")) print(item.get("url", "")) print(len(item.get("description", ""))) PY Site Reliability Engineer Infrastructure Remote http://app.internal.example:8000/jobs/site-reliability-engineer.html 57
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
