Job boards commonly surface only a title, company, and a few tags on the listings index, leaving the full role description and requirements on a separate detail page. Crawling both pages produces complete records that remain useful for search, alerts, and analysis.

Scrapy fits this pattern by extracting each detail URL from the listing response and scheduling follow-up requests via response.follow(). Each detail response is parsed by a dedicated callback that yields one structured item, while the listing callback continues walking pagination until the next-page link disappears.

Markup and URL structures change frequently on job boards, and many sites use relative links or multiple hostnames (for example www and jobs). Keep the crawl constrained to the intended domain, normalize text to remove noisy whitespace, throttle requests to avoid rate limiting, and expect JavaScript-rendered descriptions to require an underlying API endpoint rather than HTML parsing.

Steps to scrape job listings with detail pages in Scrapy:

  1. Generate a spider for the job board domain.
    $ scrapy genspider jobs app.internal.example
    Created spider 'jobs' using template 'basic' in module:
      job_board.spiders.jobs
  2. Open the spider file for editing.
    $ vi job_board/spiders/jobs.py
  3. Replace the spider code with a two-stage crawl that follows each job card link.
    import scrapy
     
     
    def join_clean_text(texts):
        parts = []
        for text in texts:
            cleaned = text.strip()
            if cleaned:
                parts.append(cleaned)
        return " ".join(parts)
     
     
    class JobsSpider(scrapy.Spider):
        name = "jobs"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/jobs/"]
     
        custom_settings = {
            "DOWNLOAD_DELAY": 1.0,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
        }
     
        def parse(self, response):
            for card in response.css("article.job"):
                detail_href = card.css("a::attr(href)").get()
                if not detail_href:
                    continue
                yield response.follow(detail_href, callback=self.parse_job)
     
            next_href = response.css("a.next::attr(href)").get()
            if next_href:
                yield response.follow(next_href, callback=self.parse)
     
        def parse_job(self, response):
            description = join_clean_text(
                response.css("div.job-description ::text").getall()
            )
            yield {
                "title": response.css("h1::text").get(default="").strip(),
                "team": response.css(".team::text").get(default="").strip(),
                "location": response.css(".location::text").get(default="").strip(),
                "description": description,
                "url": response.url,
            }

    response.follow() resolves relative href values against the current page URL.

  4. Update allowed_domains to match the target job board hostname.

    Include the exact host used in start_urls, including subdomains such as jobs.example.com.

  5. Update start_urls to the job listing index page.
  6. Update the CSS selectors for listing cards, pagination, detail fields.
  7. Run the spider with JSON feed export enabled.
    $ scrapy crawl jobs -O jobs.json
    2026-01-01 09:46:44 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: jobs.json

    Aggressive crawl rates can trigger temporary blocks, CAPTCHAs, or IP bans on job boards.

  8. Print one exported record to confirm the detail fields were scraped.
    $ python3 - <<'PY'
    import json
    
    with open("jobs.json", "r", encoding="utf-8") as f:
        items = json.load(f)
    
    item = items[0]
    print(item.get("title", ""))
    print(item.get("team", ""))
    print(item.get("location", ""))
    print(item.get("url", ""))
    print(len(item.get("description", "")))
    PY
    Site Reliability Engineer
    Infrastructure
    Remote
    http://app.internal.example:8000/jobs/site-reliability-engineer.html
    57