News sites often publish archive pages that list articles by date, section, or pagination. Scraping those listings builds a complete set of article URLs for downstream parsing, de-duplication, and historical backfills.

A Scrapy spider can start from an archive URL, extract article links, request each article page, and yield structured fields such as headline, publish timestamp, summary, plus a canonical URL. Pagination crawling keeps the spider moving across older archive pages until a crawl limit stops the run.

Archive depth on long-running sites can be very large and can trigger rate limits or blocks. Apply a page limit (or a date cutoff when archive pages expose dates), keep concurrency conservative, and avoid following non-article links that inflate scope.

Steps to scrape news article archives with Scrapy:

  1. Generate a spider for the archive host in the Scrapy project.
    $ scrapy genspider news app.internal.example
    Created spider 'news' using template 'basic' in module:
      simplifiedguide.spiders.news

    The generated spider is typically stored under <project>/<project>/spiders/news.py.

  2. Configure the spider archive parser for article links, pagination, crawl limits.
    from __future__ import annotations
     
    from typing import Any, Dict, Iterator, Optional
    from urllib.parse import urlparse
     
    import scrapy
    from w3lib.url import canonicalize_url
     
     
    class NewsSpider(scrapy.Spider):
        """Scrape a news archive, following article links plus archive pagination.
     
        Use spider arguments to control scope:
          - start_url: absolute archive URL to start from
          - max_pages: maximum number of archive pages to parse
        """
     
        name = "news"
     
        custom_settings = {
            "AUTOTHROTTLE_ENABLED": True,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
            "DOWNLOAD_DELAY": 1.0,
            "ROBOTSTXT_OBEY": True,
        }
     
        archive_article_css = "article h2 a::attr(href)"
        archive_next_css = "a[rel='next']::attr(href), a.next::attr(href)"
     
        headline_css = "h1::text"
        published_css = "time::attr(datetime)"
        summary_css = "p::text"
     
        def __init__(
            self,
            start_url: Optional[str] = None,
            max_pages: Optional[str] = None,
            *args: Any,
            **kwargs: Any,
        ) -> None:
            super().__init__(*args, **kwargs)
     
            archive_url = start_url or "http://app.internal.example:8000/news/"
            parsed = urlparse(archive_url)
     
            host = (parsed.hostname or "").strip().lower()
            if not parsed.scheme or not host:
                raise ValueError(
                    "start_url must be an absolute URL, e.g. http://app.internal.example:8000/news/"
                )
     
            allowed = {host}
            if host.startswith("www."):
                allowed.add(host[4:])
            else:
                allowed.add(f"www.{host}")
     
            self.allowed_domains = sorted(allowed)
            self.start_urls = [archive_url]
     
            self.max_pages = int(max_pages) if max_pages else None
            self.pages_seen = 0
     
        def parse(self, response: scrapy.http.Response) -> Iterator[scrapy.Request]:
            """Parse an archive page, enqueueing article requests plus the next archive page."""
            self.pages_seen += 1
            if self.max_pages is not None and self.pages_seen > self.max_pages:
                return
     
            for href in response.css(self.archive_article_css).getall():
                yield response.follow(href, callback=self.parse_article)
     
            next_page = response.css(self.archive_next_css).get()
            if next_page:
                yield response.follow(next_page, callback=self.parse)
     
        def parse_article(
            self, response: scrapy.http.Response
        ) -> Iterator[Dict[str, Optional[str]]]:
            """Extract a minimal article record from a news detail page."""
            headline = response.css(self.headline_css).get()
            published = response.css(self.published_css).get()
            summary = response.css(self.summary_css).get()
     
            item: Dict[str, Optional[str]] = {
                "headline": headline.strip() if headline else None,
                "published": published.strip() if published else None,
                "summary": summary.strip() if summary else None,
                "url": canonicalize_url(response.url),
            }
            yield item

    Replace archive_article_css, archive_next_css, plus the field selectors with site-specific CSS selectors from the archive and article HTML.

  3. Run the spider with a page limit, exporting items to JSON.
    $ scrapy crawl news -a start_url=http://app.internal.example:8000/news/ -a max_pages=1 -O news.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
    2026-01-01 09:24:51 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: news.json

    High request rates can trigger bans or CAPTCHAs; keep scope small during selector development.

  4. Inspect the first extracted items to confirm expected fields, URL shapes.
    $ head -n 12 news.json
    [
    {"headline": "Usage Tips", "published": "2025-12-15", "summary": "Best practices for keeping crawls stable.", "url": "http://app.internal.example:8000/news/usage-tips.html"},
    {"headline": "Launch Update", "published": "2026-01-01", "summary": "New features are now available to all accounts.", "url": "http://app.internal.example:8000/news/launch-update.html"}
    ]
  5. Count extracted items to validate archive coverage against the configured scope.
    $ python -c 'import json; print(len(json.load(open("news.json", encoding="utf-8"))))'
    2