Job boards commonly surface only a title, company, and a few tags on the listings index, leaving the full role description and requirements on a separate detail page. Crawling both pages produces complete records that remain useful for search, alerts, and analysis.
Scrapy fits this pattern by extracting each detail URL from the listing response and scheduling follow-up requests via response.follow(). Each detail response is parsed by a dedicated callback that yields one structured item, while the listing callback continues walking pagination until the next-page link disappears.
Markup and URL structures change frequently on job boards, and many sites use relative links or multiple hostnames (for example www and jobs). Keep the crawl constrained to the intended domain, normalize text to remove noisy whitespace, throttle requests to avoid rate limiting, and expect JavaScript-rendered descriptions to require an underlying API endpoint rather than HTML parsing.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
$ scrapy genspider jobs app.internal.example Created spider 'jobs' using template 'basic' in module: job_board.spiders.jobs
$ vi job_board/spiders/jobs.py
import scrapy def join_clean_text(texts): parts = [] for text in texts: cleaned = text.strip() if cleaned: parts.append(cleaned) return " ".join(parts) class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/jobs/"] custom_settings = { "DOWNLOAD_DELAY": 1.0, "CONCURRENT_REQUESTS_PER_DOMAIN": 2, } def parse(self, response): for card in response.css("article.job"): detail_href = card.css("a::attr(href)").get() if not detail_href: continue yield response.follow(detail_href, callback=self.parse_job) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_job(self, response): description = join_clean_text( response.css("div.job-description ::text").getall() ) yield { "title": response.css("h1::text").get(default="").strip(), "team": response.css(".team::text").get(default="").strip(), "location": response.css(".location::text").get(default="").strip(), "description": description, "url": response.url, }
response.follow() resolves relative href values against the current page URL.
Include the exact host used in start_urls, including subdomains such as jobs.example.com.
$ scrapy crawl jobs -O jobs.json 2026-01-01 09:46:44 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: jobs.json
Aggressive crawl rates can trigger temporary blocks, CAPTCHAs, or IP bans on job boards.
$ python3 - <<'PY'
import json
with open("jobs.json", "r", encoding="utf-8") as f:
items = json.load(f)
item = items[0]
print(item.get("title", ""))
print(item.get("team", ""))
print(item.get("location", ""))
print(item.get("url", ""))
print(len(item.get("description", "")))
PY
Site Reliability Engineer
Infrastructure
Remote
http://app.internal.example:8000/jobs/site-reliability-engineer.html
57