How to scrape news article archives with Scrapy

News sites often publish archive pages that list articles by date, section, or pagination. Scraping those listings builds a complete set of article URLs for downstream parsing, de-duplication, and historical backfills.

A Scrapy spider can start from an archive URL, extract article links, request each article page, and yield structured fields such as headline, publish timestamp, summary, plus a canonical URL. Pagination crawling keeps the spider moving across older archive pages until a crawl limit stops the run.

Archive depth on long-running sites can be very large and can trigger rate limits or blocks. Apply a page limit (or a date cutoff when archive pages expose dates), keep concurrency conservative, and avoid following non-article links that inflate scope.

Steps to scrape news article archives with Scrapy:

Generate a spider for the archive host in the Scrapy project.
```
$ scrapy genspider news app.internal.example
Created spider 'news' using template 'basic' in module:
  simplifiedguide.spiders.news
```
The generated spider is typically stored under <project>/<project>/spiders/news.py.

Configure the spider archive parser for article links, pagination, crawl limits.

from __future__ import annotations
 
from typing import Any, Dict, Iterator, Optional
from urllib.parse import urlparse
 
import scrapy
from w3lib.url import canonicalize_url
 
 
class NewsSpider(scrapy.Spider):
    """Scrape a news archive, following article links plus archive pagination.
 
    Use spider arguments to control scope:
      - start_url: absolute archive URL to start from
      - max_pages: maximum number of archive pages to parse
    """
 
    name = "news"
 
    custom_settings = {
        "AUTOTHROTTLE_ENABLED": True,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
        "DOWNLOAD_DELAY": 1.0,
        "ROBOTSTXT_OBEY": True,
    }
 
    archive_article_css = "article h2 a::attr(href)"
    archive_next_css = "a[rel='next']::attr(href), a.next::attr(href)"
 
    headline_css = "h1::text"
    published_css = "time::attr(datetime)"
    summary_css = "p::text"
 
    def __init__(
        self,
        start_url: Optional[str] = None,
        max_pages: Optional[str] = None,
        *args: Any,
        **kwargs: Any,
    ) -> None:
        super().__init__(*args, **kwargs)
 
        archive_url = start_url or "http://app.internal.example:8000/news/"
        parsed = urlparse(archive_url)
 
        host = (parsed.hostname or "").strip().lower()
        if not parsed.scheme or not host:
            raise ValueError(
                "start_url must be an absolute URL, e.g. http://app.internal.example:8000/news/"
            )
 
        allowed = {host}
        if host.startswith("www."):
            allowed.add(host[4:])
        else:
            allowed.add(f"www.{host}")
 
        self.allowed_domains = sorted(allowed)
        self.start_urls = [archive_url]
 
        self.max_pages = int(max_pages) if max_pages else None
        self.pages_seen = 0
 
    def parse(self, response: scrapy.http.Response) -> Iterator[scrapy.Request]:
        """Parse an archive page, enqueueing article requests plus the next archive page."""
        self.pages_seen += 1
        if self.max_pages is not None and self.pages_seen > self.max_pages:
            return
 
        for href in response.css(self.archive_article_css).getall():
            yield response.follow(href, callback=self.parse_article)
 
        next_page = response.css(self.archive_next_css).get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
 
    def parse_article(
        self, response: scrapy.http.Response
    ) -> Iterator[Dict[str, Optional[str]]]:
        """Extract a minimal article record from a news detail page."""
        headline = response.css(self.headline_css).get()
        published = response.css(self.published_css).get()
        summary = response.css(self.summary_css).get()
 
        item: Dict[str, Optional[str]] = {
            "headline": headline.strip() if headline else None,
            "published": published.strip() if published else None,
            "summary": summary.strip() if summary else None,
            "url": canonicalize_url(response.url),
        }
        yield item

Replace archive_article_css, archive_next_css, plus the field selectors with site-specific CSS selectors from the archive and article HTML.

Run the spider with a page limit, exporting items to JSON.

$ scrapy crawl news -a start_url=http://app.internal.example:8000/news/ -a max_pages=1 -O news.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
2026-01-01 09:24:51 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: news.json

High request rates can trigger bans or CAPTCHAs; keep scope small during selector development.

Inspect the first extracted items to confirm expected fields, URL shapes.

$ head -n 12 news.json
[
{"headline": "Usage Tips", "published": "2025-12-15", "summary": "Best practices for keeping crawls stable.", "url": "http://app.internal.example:8000/news/usage-tips.html"},
{"headline": "Launch Update", "published": "2026-01-01", "summary": "New features are now available to all accounts.", "url": "http://app.internal.example:8000/news/launch-update.html"}
]

Count extracted items to validate archive coverage against the configured scope.

$ python -c 'import json; print(len(json.load(open("news.json", encoding="utf-8"))))'
2

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.