Job boards often split visible listing cards from the full role description, requirements, and metadata on a separate page, so scraping only the index leaves the exported data incomplete. Following each listing into its detail page produces records that are useful for search, alerts, and later analysis.
In Scrapy, the listing callback extracts job-detail URLs and pagination URLs from each Response, then schedules new requests with response.follow(). A second callback parses the detail page and yields one structured item, while the listing callback keeps moving through the jobs index until no next-page link remains.
Job boards change markup often, mix relative and absolute links, and sometimes place sponsored or off-site cards in the same grid as real openings. Confirm selectors in scrapy shell before hard-coding them, keep the crawl limited to the intended domain, and throttle requests so the run does not trigger rate limits or block pages that would leave the export incomplete.
Related: How to scrape paginated pages with Scrapy
Related: How to use CSS selectors in Scrapy
Steps to scrape job listings with detail pages in Scrapy:
- Create a new Scrapy project for the job-board crawler.
$ scrapy startproject job_board New Scrapy project 'job_board', using template directory '##### snipped #####', created in: /root/sg-work/job_board You can start your first spider with: cd job_board scrapy genspider example example.com - Change to the new project directory.
$ cd job_board
- Generate a spider for the target job-board host.
$ scrapy genspider jobs app.internal.example Created spider 'jobs' using template 'basic' in module: job_board.spiders.jobs
- Probe the listings page in scrapy shell to confirm the detail-link and pagination selectors.
$ scrapy shell "http://app.internal.example:8000/jobs/" ##### snipped ##### >>> response.css("article.job a::attr(href)").getall() ['/jobs/site-reliability-engineer.html', '/jobs/platform-engineer.html'] >>> response.css("a.next::attr(href)").get() '/jobs/page2.html'Related: How to use Scrapy shell
- Probe one job detail page in scrapy shell to confirm the title and description selectors.
$ scrapy shell "http://app.internal.example:8000/jobs/site-reliability-engineer.html" ##### snipped ##### >>> response.css("h1::text").get() 'Site Reliability Engineer' >>> response.css("div.job-description ::text").getall() ['Own service reliability.', 'Improve deployment safety.']Prefer selectors that target stable headings, labels, or container structure instead of short-lived CSS class names.
- Replace the generated spider with a two-stage crawl that follows each job card into its detail page.
import scrapy def join_clean_text(texts): return " ".join(text.strip() for text in texts if text.strip()) class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/jobs/"] custom_settings = { "DOWNLOAD_DELAY": 1.0, "CONCURRENT_REQUESTS_PER_DOMAIN": 2, } def parse(self, response): for card in response.css("article.job"): detail_href = card.css("a::attr(href)").get() if detail_href: yield response.follow(detail_href, callback=self.parse_job) next_href = response.css("a.next::attr(href)").get() if next_href: yield response.follow(next_href, callback=self.parse) def parse_job(self, response): yield { "title": response.css("h1::text").get(default="").strip(), "team": response.css(".team::text").get(default="").strip(), "location": response.css(".location::text").get(default="").strip(), "description": join_clean_text( response.css("div.job-description ::text").getall() ), "url": response.url, }
response.follow() accepts relative links directly, so there is no need to call response.urljoin() first.
- Update allowed_domains, start_urls, and the CSS selectors to match the actual job-board host and markup.
If the detail pages live on a different parent domain, OffsiteMiddleware drops those requests until allowed_domains includes that domain.
- Run the spider with JSON feed export enabled.
$ scrapy crawl jobs -O jobs.json ##### snipped ##### 2026-04-16 05:45:54 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: jobs.json
Aggressive crawl rates can trigger rate limits, CAPTCHAs, or account blocks on job boards.
- Print the first exported record to confirm the detail fields were written to the feed.
$ python3 -c "import json; print(json.load(open('jobs.json', encoding='utf-8'))[0])" {'title': 'Platform Engineer', 'team': 'Platform', 'location': 'Kuala Lumpur', 'description': 'Build internal tooling. Support CI pipelines.', 'url': 'http://app.internal.example:8000/jobs/platform-engineer.html'}Blank description values or repeated listing URLs usually mean the detail-page selectors or callback target still need adjustment.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
