News sites hide most historical coverage behind archive pages, so a spider that only parses the homepage or latest listing misses older articles that still matter for backfills, analytics, or search indexes.

A Scrapy archive spider starts from one archive URL, extracts article links from each listing page, follows the pagination link to older pages, and yields normalized article records from the detail pages. That keeps the crawl logic focused on two page types: archive pages that discover work and article pages that extract the final fields.

Archive templates often mix real stories with promos, category links, or repeating calendar navigation, so selectors and page limits need to stay tight while the crawl is being tuned. Current Scrapy projects also treat spider arguments such as max_pages as strings, and recent scrapy startproject output already enables polite defaults like ROBOTSTXT_OBEY = True, so numeric limits should be converted before comparison instead of used raw.

Steps to scrape news article archives with Scrapy:

  1. Create a new Scrapy project for the archive crawl.
    $ scrapy startproject archive_demo
    New Scrapy project 'archive_demo', using template directory '##### snipped #####', created in:
        /srv/archive_demo
    
    You can start your first spider with:
        cd /srv/archive_demo
        scrapy genspider example example.com
  2. Change into the project directory.
    $ cd /srv/archive_demo
  3. Generate a basic spider for the archive host.
    $ scrapy genspider news news.example.net
    Created spider 'news' using template 'basic' in module:
      archive_demo.spiders.news
  4. Test the archive selectors in scrapy shell before editing the spider.
    $ scrapy shell 'https://news.example.net/archive/' --nolog
    [s] Available Scrapy objects:
    [s]   response   <200 https://news.example.net/archive/>
    ##### snipped #####
    >>> response.css("article.archive-entry h2 a::attr(href)").getall()
    ['articles/quarterly-results.html', 'articles/platform-upgrade.html']
    >>> response.css("nav.pagination a.next::attr(href)").get()
    'page2.html'

    The archive selector should return only article detail links, and the next-page selector should return one pagination URL or None when the archive ends.

  5. Replace the generated spider with archive discovery and article parsing logic.
    archive_demo/spiders/news.py
    from urllib.parse import urlparse
     
    import scrapy
     
     
    class NewsSpider(scrapy.Spider):
        name = "news"
     
        async def start(self):
            archive_url = getattr(
                self,
                "archive_url",
                "https://news.example.net/archive/",
            )
            self.max_pages = int(getattr(self, "max_pages", "2"))
            self.pages_seen = 0
     
            host = urlparse(archive_url).hostname
            if host:
                self.allowed_domains = [host]
     
            yield scrapy.Request(archive_url, callback=self.parse)
     
        def parse(self, response):
            self.pages_seen += 1
     
            yield from response.follow_all(
                response.css("article.archive-entry h2 a::attr(href)"),
                callback=self.parse_article,
            )
     
            if self.pages_seen < self.max_pages:
                next_page = response.css("nav.pagination a.next::attr(href)").get()
                if next_page:
                    yield response.follow(next_page, callback=self.parse)
     
        def parse_article(self, response):
            yield {
                "headline": response.css("h1::text").get(default="").strip(),
                "published": response.css("time::attr(datetime)").get(default="").strip(),
                "summary": response.css("div.article-summary p::text").get(default="").strip(),
                "url": response.url,
            }

    Replace the archive, pagination, and article selectors with the target site's actual HTML, and keep archive_url as an absolute URL so allowed_domains is derived from the correct host.

  6. Enable AutoThrottle in settings.py while keeping the generated pacing defaults.
    archive_demo/settings.py
    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 1.0
    AUTOTHROTTLE_MAX_DELAY = 10.0

    Current scrapy startproject output already writes ROBOTSTXT_OBEY = True, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_DELAY = 1, and FEED_EXPORT_ENCODING = “utf-8” into settings.py, so this step only adds adaptive slowdown when the archive gets slower.

    Archive pages can fan out quickly into hundreds or thousands of article URLs, so keep max_pages low until the selectors stop following promos, tags, and off-topic sections.

  7. Run the spider with an explicit archive URL, a small page limit, and JSON export.
    $ scrapy crawl news -a archive_url=https://news.example.net/archive/ -a max_pages=2 -O articles.json
    ##### snipped #####
    2026-04-16 06:09:52 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: articles.json
    2026-04-16 06:09:52 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites any existing articles.json and keeps the result as one valid JSON array when the crawl finishes cleanly.

  8. Review the exported records before increasing the archive depth.
    $ cat articles.json
    [
    {"headline": "Platform upgrade", "published": "2026-03-18T09:30:00Z", "summary": "Backend services moved to the new deployment baseline.", "url": "https://news.example.net/articles/platform-upgrade.html"},
    {"headline": "Quarterly results", "published": "2026-04-01T08:00:00Z", "summary": "Revenue and subscription retention both increased.", "url": "https://news.example.net/articles/quarterly-results.html"},
    {"headline": "Customer story", "published": "2026-01-22T11:45:00Z", "summary": "A cross-region migration finished without customer downtime.", "url": "https://news.example.net/articles/customer-story.html"},
    {"headline": "Security advisory", "published": "2026-02-10T14:00:00Z", "summary": "Operators should rotate credentials after the maintenance window.", "url": "https://news.example.net/articles/security-advisory.html"}
    ]

    The published timestamps, URLs, and record count should match the archive pages that were allowed by max_pages.

Notes

  • async def start() is the current Scrapy pattern for spiders that need to normalize spider arguments or build the first request dynamically.
  • response.follow_all() resolves relative article links from the current archive page, while response.follow() handles the next archive page without manual urljoin() code.
  • Keep the archive link selector narrower than a broad site-wide a::attr(href) match so section headers, promo cards, or repeated footer links do not flood the crawl queue.
  • If the archive exposes dates on the listing page, add a cutoff check in parse() so the spider can stop before it reaches older irrelevant pages.