Job boards usually expose only teaser fields such as the role title, location, and summary on the listings page, while the full description, team, and employment type live on the detail page. A spider that stops at the index exports incomplete records and often repeats only the teaser data.
Current Scrapy releases fit this pattern well because scrapy shell can confirm the list-page and detail-page selectors before any crawl starts, response.follow() resolves the relative detail links directly, and cb_kwargs can carry listing-page values into the detail callback when the final page omits a field such as location.
The current scrapy startproject template already seeds polite defaults such as ROBOTSTXT_OBEY = True, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_DELAY = 1, and FEED_EXPORT_ENCODING = "utf-8", but job boards still mix ads, off-site application links, and JavaScript-only sections. Keep selectors anchored to the job card, widen allowed_domains only when the detail pages truly live elsewhere, and switch to an API or rendered workflow when the description is missing from the HTML that Scrapy actually downloads.
Related: How to scrape paginated pages with Scrapy
Related: How to use Scrapy shell
Steps to scrape job listings with detail pages in Scrapy:
- Create a new Scrapy project for the list-to-detail crawl.
$ scrapy startproject job_board New Scrapy project 'job_board', using template directory '##### snipped #####', created in: /home/user/job_board You can start your first spider with: cd /home/user/job_board scrapy genspider example example.com - Change to the new project directory.
$ cd /home/user/job_board
- Generate a basic spider for the job-board host.
$ scrapy genspider jobs careers.example.com Created spider 'jobs' using template 'basic' in module: job_board.spiders.jobs
- Review the generated crawl settings before raising request rates.
- job_board/settings.py
ROBOTSTXT_OBEY = True CONCURRENT_REQUESTS_PER_DOMAIN = 1 DOWNLOAD_DELAY = 1 FEED_EXPORT_ENCODING = "utf-8"
Current scrapy startproject templates already write these values. Raise concurrency or lower delay only after the selectors stay tight and the target site tolerates it. Related: How to enable AutoThrottle in Scrapy
Related: How to set a download delay in Scrapy - Start scrapy shell against the listings page and confirm the detail-link and pagination selectors.
$ scrapy shell 'https://careers.example.com/jobs/' --nolog [s] Available Scrapy objects: [s] response <200 https://careers.example.com/jobs/> ##### snipped ##### >>> response.css("article.job a.detail::attr(href)").getall() ['/jobs/site-reliability-engineer.html', '/jobs/platform-engineer.html'] >>> response.css("a.next::attr(href)").get() '/jobs/page2.html'Keep the selector anchored to the job card so banners, filter controls, and footer links do not widen the crawl. Related: How to use CSS selectors in Scrapy
- Start scrapy shell against one job detail page and confirm the fields that only exist after following the listing URL.
$ scrapy shell 'https://careers.example.com/jobs/site-reliability-engineer.html' --nolog [s] Available Scrapy objects: [s] response <200 https://careers.example.com/jobs/site-reliability-engineer.html> ##### snipped ##### >>> response.css("h1::text").get() 'Site Reliability Engineer' >>> response.css("div.job-description p::text").getall() ['Own service reliability.', 'Improve deployment safety.']Test selectors against the HTML that Scrapy downloads, not the browser DOM after scripts run. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
- Replace job_board/spiders/jobs.py with a spider that follows each job card into its detail page, carries the listing location with cb_kwargs, and queues the next results page when present.
- job_board/spiders/jobs.py
import scrapy def join_text(values): return " ".join(value.strip() for value in values if value.strip()) class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["careers.example.com"] start_urls = ["https://careers.example.com/jobs/"] def parse(self, response): for card in response.css("article.job"): href = card.css("a.detail::attr(href)").get() card_location = card.css("p.location::text").get(default="").strip() if href: yield response.follow( href, callback=self.parse_job, cb_kwargs={"card_location": card_location}, ) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_job(self, response, card_location): yield { "title": response.css("h1::text").get(default="").strip(), "team": response.css("p.team::text").get(default="").strip(), "location": response.css("p.location::text").get(default=card_location).strip(), "employment_type": response.css("p.employment-type::text") .get(default="") .strip(), "description": join_text( response.css("div.job-description p::text").getall() ), "url": response.url, }
If the detail pages resolve under another host, add that host to allowed_domains or OffsiteMiddleware drops the requests before the detail callback runs.
- Run the spider and overwrite the previous JSON export on each test crawl.
$ scrapy crawl jobs -O jobs.json 2026-04-22 06:41:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.example.com/jobs/page2.html> (referer: https://careers.example.com/jobs/) 2026-04-22 06:41:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.example.com/jobs/platform-engineer.html> {'title': 'Platform Engineer', 'team': 'Platform', 'location': 'Remote', 'employment_type': 'Full-time', 'description': 'Build internal tooling. Support CI pipelines.', 'url': 'https://careers.example.com/jobs/platform-engineer.html'} ##### snipped ##### 2026-04-22 06:41:04 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: jobs.json 2026-04-22 06:41:04 [scrapy.core.engine] INFO: Spider closed (finished)-O is the short form of –overwrite-output, so each test run replaces the previous local export instead of appending stale records.
- Open the exported JSON and confirm that each record includes the fields collected from the detail page.
$ cat jobs.json [ {"title": "Platform Engineer", "team": "Platform", "location": "Remote", "employment_type": "Full-time", "description": "Build internal tooling. Support CI pipelines.", "url": "https://careers.example.com/jobs/platform-engineer.html"}, {"title": "Site Reliability Engineer", "team": "Platform", "location": "Kuala Lumpur", "employment_type": "Full-time", "description": "Own service reliability. Improve deployment safety.", "url": "https://careers.example.com/jobs/site-reliability-engineer.html"}, {"title": "Data Engineer", "team": "Data", "location": "Singapore", "employment_type": "Contract", "description": "Build ingestion pipelines. Maintain warehouse transforms.", "url": "https://careers.example.com/jobs/data-engineer.html"} ]If location stays blank on detail pages that omit it, keep the listing-page value in cb_kwargs as shown above. If description stays empty even though the browser shows text, move to an API or rendered workflow. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
Related: How to use Selenium with Scrapy
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
