How to scrape job listings with detail pages in Scrapy

Job boards commonly surface only a title, company, and a few tags on the listings index, leaving the full role description and requirements on a separate detail page. Crawling both pages produces complete records that remain useful for search, alerts, and analysis.

Scrapy fits this pattern by extracting each detail URL from the listing response and scheduling follow-up requests via response.follow(). Each detail response is parsed by a dedicated callback that yields one structured item, while the listing callback continues walking pagination until the next-page link disappears.

Markup and URL structures change frequently on job boards, and many sites use relative links or multiple hostnames (for example www and jobs). Keep the crawl constrained to the intended domain, normalize text to remove noisy whitespace, throttle requests to avoid rate limiting, and expect JavaScript-rendered descriptions to require an underlying API endpoint rather than HTML parsing.

Steps to scrape job listings with detail pages in Scrapy:

Generate a spider for the job board domain.

$ scrapy genspider jobs app.internal.example
Created spider 'jobs' using template 'basic' in module:
  job_board.spiders.jobs

Open the spider file for editing.
```
$ vi job_board/spiders/jobs.py
```

Replace the spider code with a two-stage crawl that follows each job card link.

import scrapy
 
 
def join_clean_text(texts):
    parts = []
    for text in texts:
        cleaned = text.strip()
        if cleaned:
            parts.append(cleaned)
    return " ".join(parts)
 
 
class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["app.internal.example"]
    start_urls = ["http://app.internal.example:8000/jobs/"]
 
    custom_settings = {
        "DOWNLOAD_DELAY": 1.0,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
    }
 
    def parse(self, response):
        for card in response.css("article.job"):
            detail_href = card.css("a::attr(href)").get()
            if not detail_href:
                continue
            yield response.follow(detail_href, callback=self.parse_job)
 
        next_href = response.css("a.next::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)
 
    def parse_job(self, response):
        description = join_clean_text(
            response.css("div.job-description ::text").getall()
        )
        yield {
            "title": response.css("h1::text").get(default="").strip(),
            "team": response.css(".team::text").get(default="").strip(),
            "location": response.css(".location::text").get(default="").strip(),
            "description": description,
            "url": response.url,
        }

response.follow() resolves relative href values against the current page URL.

Update allowed_domains to match the target job board hostname.

Include the exact host used in start_urls, including subdomains such as jobs.example.com.
Update start_urls to the job listing index page.
Update the CSS selectors for listing cards, pagination, detail fields.
Run the spider with JSON feed export enabled.
```
$ scrapy crawl jobs -O jobs.json
2026-01-01 09:46:44 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: jobs.json
```
Aggressive crawl rates can trigger temporary blocks, CAPTCHAs, or IP bans on job boards.

Print one exported record to confirm the detail fields were scraped.

$ python3 - <<'PY'
import json

with open("jobs.json", "r", encoding="utf-8") as f:
    items = json.load(f)

item = items[0]
print(item.get("title", ""))
print(item.get("team", ""))
print(item.get("location", ""))
print(item.get("url", ""))
print(len(item.get("description", "")))
PY
Site Reliability Engineer
Infrastructure
Remote
http://app.internal.example:8000/jobs/site-reliability-engineer.html
57

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.