Job boards usually expose only teaser fields such as the role title, location, and summary on the listings page, while the full description, team, and employment type live on the detail page. A spider that stops at the index exports incomplete records and often repeats only the teaser data.
Current Scrapy releases fit this pattern well because scrapy shell can confirm the list-page and detail-page selectors before any crawl starts, response.follow() resolves the relative detail links directly, and cb_kwargs can carry listing-page values into the detail callback when the final page omits a field such as location.
The current scrapy startproject template already seeds polite defaults such as ROBOTSTXT_OBEY = True, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_DELAY = 1, and FEED_EXPORT_ENCODING = "utf-8", but job boards still mix ads, off-site application links, and JavaScript-only sections. Keep selectors anchored to the job card, widen allowed_domains only when the detail pages truly live elsewhere, and switch to an API or rendered workflow when the description is missing from the HTML that Scrapy actually downloads.
Related: How to scrape paginated pages with Scrapy
Related: How to use Scrapy shell
$ scrapy startproject job_board
New Scrapy project 'job_board', using template directory '##### snipped #####', created in:
/home/user/job_board
You can start your first spider with:
cd /home/user/job_board
scrapy genspider example example.com
$ cd /home/user/job_board
$ scrapy genspider jobs careers.example.com Created spider 'jobs' using template 'basic' in module: job_board.spiders.jobs
ROBOTSTXT_OBEY = True CONCURRENT_REQUESTS_PER_DOMAIN = 1 DOWNLOAD_DELAY = 1 FEED_EXPORT_ENCODING = "utf-8"
Current scrapy startproject templates already write these values. Raise concurrency or lower delay only after the selectors stay tight and the target site tolerates it. Related: How to enable AutoThrottle in Scrapy
Related: How to set a download delay in Scrapy
$ scrapy shell 'https://careers.example.com/jobs/' --nolog
[s] Available Scrapy objects:
[s] response <200 https://careers.example.com/jobs/>
##### snipped #####
>>> response.css("article.job a.detail::attr(href)").getall()
['/jobs/site-reliability-engineer.html', '/jobs/platform-engineer.html']
>>> response.css("a.next::attr(href)").get()
'/jobs/page2.html'
Keep the selector anchored to the job card so banners, filter controls, and footer links do not widen the crawl. Related: How to use CSS selectors in Scrapy
$ scrapy shell 'https://careers.example.com/jobs/site-reliability-engineer.html' --nolog
[s] Available Scrapy objects:
[s] response <200 https://careers.example.com/jobs/site-reliability-engineer.html>
##### snipped #####
>>> response.css("h1::text").get()
'Site Reliability Engineer'
>>> response.css("div.job-description p::text").getall()
['Own service reliability.', 'Improve deployment safety.']
Test selectors against the HTML that Scrapy downloads, not the browser DOM after scripts run. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
import scrapy def join_text(values): return " ".join(value.strip() for value in values if value.strip()) class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["careers.example.com"] start_urls = ["https://careers.example.com/jobs/"] def parse(self, response): for card in response.css("article.job"): href = card.css("a.detail::attr(href)").get() card_location = card.css("p.location::text").get(default="").strip() if href: yield response.follow( href, callback=self.parse_job, cb_kwargs={"card_location": card_location}, ) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_job(self, response, card_location): yield { "title": response.css("h1::text").get(default="").strip(), "team": response.css("p.team::text").get(default="").strip(), "location": response.css("p.location::text").get(default=card_location).strip(), "employment_type": response.css("p.employment-type::text") .get(default="") .strip(), "description": join_text( response.css("div.job-description p::text").getall() ), "url": response.url, }
If the detail pages resolve under another host, add that host to allowed_domains or OffsiteMiddleware drops the requests before the detail callback runs.
$ scrapy crawl jobs -O jobs.json
2026-04-22 06:41:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.example.com/jobs/page2.html> (referer: https://careers.example.com/jobs/)
2026-04-22 06:41:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.example.com/jobs/platform-engineer.html>
{'title': 'Platform Engineer', 'team': 'Platform', 'location': 'Remote', 'employment_type': 'Full-time', 'description': 'Build internal tooling. Support CI pipelines.', 'url': 'https://careers.example.com/jobs/platform-engineer.html'}
##### snipped #####
2026-04-22 06:41:04 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: jobs.json
2026-04-22 06:41:04 [scrapy.core.engine] INFO: Spider closed (finished)
-O is the short form of –overwrite-output, so each test run replaces the previous local export instead of appending stale records.
$ cat jobs.json
[
{"title": "Platform Engineer", "team": "Platform", "location": "Remote", "employment_type": "Full-time", "description": "Build internal tooling. Support CI pipelines.", "url": "https://careers.example.com/jobs/platform-engineer.html"},
{"title": "Site Reliability Engineer", "team": "Platform", "location": "Kuala Lumpur", "employment_type": "Full-time", "description": "Own service reliability. Improve deployment safety.", "url": "https://careers.example.com/jobs/site-reliability-engineer.html"},
{"title": "Data Engineer", "team": "Data", "location": "Singapore", "employment_type": "Contract", "description": "Build ingestion pipelines. Maintain warehouse transforms.", "url": "https://careers.example.com/jobs/data-engineer.html"}
]
If location stays blank on detail pages that omit it, keep the listing-page value in cb_kwargs as shown above. If description stays empty even though the browser shows text, move to an API or rendered workflow. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
Related: How to use Selenium with Scrapy