Job boards often split visible listing cards from the full role description, requirements, and metadata on a separate page, so scraping only the index leaves the exported data incomplete. Following each listing into its detail page produces records that are useful for search, alerts, and later analysis.

In Scrapy, the listing callback extracts job-detail URLs and pagination URLs from each Response, then schedules new requests with response.follow(). A second callback parses the detail page and yields one structured item, while the listing callback keeps moving through the jobs index until no next-page link remains.

Job boards change markup often, mix relative and absolute links, and sometimes place sponsored or off-site cards in the same grid as real openings. Confirm selectors in scrapy shell before hard-coding them, keep the crawl limited to the intended domain, and throttle requests so the run does not trigger rate limits or block pages that would leave the export incomplete.

Steps to scrape job listings with detail pages in Scrapy:

  1. Create a new Scrapy project for the job-board crawler.
    $ scrapy startproject job_board
    New Scrapy project 'job_board', using template directory '##### snipped #####', created in:
        /root/sg-work/job_board
    
    You can start your first spider with:
        cd job_board
        scrapy genspider example example.com
  2. Change to the new project directory.
    $ cd job_board
  3. Generate a spider for the target job-board host.
    $ scrapy genspider jobs app.internal.example
    Created spider 'jobs' using template 'basic' in module:
      job_board.spiders.jobs
  4. Probe the listings page in scrapy shell to confirm the detail-link and pagination selectors.
    $ scrapy shell "http://app.internal.example:8000/jobs/"
    ##### snipped #####
    >>> response.css("article.job a::attr(href)").getall()
    ['/jobs/site-reliability-engineer.html', '/jobs/platform-engineer.html']
    >>> response.css("a.next::attr(href)").get()
    '/jobs/page2.html'
  5. Probe one job detail page in scrapy shell to confirm the title and description selectors.
    $ scrapy shell "http://app.internal.example:8000/jobs/site-reliability-engineer.html"
    ##### snipped #####
    >>> response.css("h1::text").get()
    'Site Reliability Engineer'
    >>> response.css("div.job-description ::text").getall()
    ['Own service reliability.', 'Improve deployment safety.']

    Prefer selectors that target stable headings, labels, or container structure instead of short-lived CSS class names.

  6. Replace the generated spider with a two-stage crawl that follows each job card into its detail page.
    import scrapy
     
     
    def join_clean_text(texts):
        return " ".join(text.strip() for text in texts if text.strip())
     
     
    class JobsSpider(scrapy.Spider):
        name = "jobs"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/jobs/"]
     
        custom_settings = {
            "DOWNLOAD_DELAY": 1.0,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
        }
     
        def parse(self, response):
            for card in response.css("article.job"):
                detail_href = card.css("a::attr(href)").get()
                if detail_href:
                    yield response.follow(detail_href, callback=self.parse_job)
     
            next_href = response.css("a.next::attr(href)").get()
            if next_href:
                yield response.follow(next_href, callback=self.parse)
     
        def parse_job(self, response):
            yield {
                "title": response.css("h1::text").get(default="").strip(),
                "team": response.css(".team::text").get(default="").strip(),
                "location": response.css(".location::text").get(default="").strip(),
                "description": join_clean_text(
                    response.css("div.job-description ::text").getall()
                ),
                "url": response.url,
            }

    response.follow() accepts relative links directly, so there is no need to call response.urljoin() first.

  7. Update allowed_domains, start_urls, and the CSS selectors to match the actual job-board host and markup.

    If the detail pages live on a different parent domain, OffsiteMiddleware drops those requests until allowed_domains includes that domain.

  8. Run the spider with JSON feed export enabled.
    $ scrapy crawl jobs -O jobs.json
    ##### snipped #####
    2026-04-16 05:45:54 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: jobs.json

    Aggressive crawl rates can trigger rate limits, CAPTCHAs, or account blocks on job boards.

  9. Print the first exported record to confirm the detail fields were written to the feed.
    $ python3 -c "import json; print(json.load(open('jobs.json', encoding='utf-8'))[0])"
    {'title': 'Platform Engineer', 'team': 'Platform', 'location': 'Kuala Lumpur', 'description': 'Build internal tooling. Support CI pipelines.', 'url': 'http://app.internal.example:8000/jobs/platform-engineer.html'}

    Blank description values or repeated listing URLs usually mean the detail-page selectors or callback target still need adjustment.