News sites often publish archive pages that list articles by date, section, or pagination. Scraping those listings builds a complete set of article URLs for downstream parsing, de-duplication, and historical backfills.
A Scrapy spider can start from an archive URL, extract article links, request each article page, and yield structured fields such as headline, publish timestamp, summary, plus a canonical URL. Pagination crawling keeps the spider moving across older archive pages until a crawl limit stops the run.
Archive depth on long-running sites can be very large and can trigger rate limits or blocks. Apply a page limit (or a date cutoff when archive pages expose dates), keep concurrency conservative, and avoid following non-article links that inflate scope.
Steps to scrape news article archives with Scrapy:
- Generate a spider for the archive host in the Scrapy project.
$ scrapy genspider news app.internal.example Created spider 'news' using template 'basic' in module: simplifiedguide.spiders.news
The generated spider is typically stored under <project>/<project>/spiders/news.py.
- Configure the spider archive parser for article links, pagination, crawl limits.
from __future__ import annotations from typing import Any, Dict, Iterator, Optional from urllib.parse import urlparse import scrapy from w3lib.url import canonicalize_url class NewsSpider(scrapy.Spider): """Scrape a news archive, following article links plus archive pagination. Use spider arguments to control scope: - start_url: absolute archive URL to start from - max_pages: maximum number of archive pages to parse """ name = "news" custom_settings = { "AUTOTHROTTLE_ENABLED": True, "CONCURRENT_REQUESTS_PER_DOMAIN": 4, "DOWNLOAD_DELAY": 1.0, "ROBOTSTXT_OBEY": True, } archive_article_css = "article h2 a::attr(href)" archive_next_css = "a[rel='next']::attr(href), a.next::attr(href)" headline_css = "h1::text" published_css = "time::attr(datetime)" summary_css = "p::text" def __init__( self, start_url: Optional[str] = None, max_pages: Optional[str] = None, *args: Any, **kwargs: Any, ) -> None: super().__init__(*args, **kwargs) archive_url = start_url or "http://app.internal.example:8000/news/" parsed = urlparse(archive_url) host = (parsed.hostname or "").strip().lower() if not parsed.scheme or not host: raise ValueError( "start_url must be an absolute URL, e.g. http://app.internal.example:8000/news/" ) allowed = {host} if host.startswith("www."): allowed.add(host[4:]) else: allowed.add(f"www.{host}") self.allowed_domains = sorted(allowed) self.start_urls = [archive_url] self.max_pages = int(max_pages) if max_pages else None self.pages_seen = 0 def parse(self, response: scrapy.http.Response) -> Iterator[scrapy.Request]: """Parse an archive page, enqueueing article requests plus the next archive page.""" self.pages_seen += 1 if self.max_pages is not None and self.pages_seen > self.max_pages: return for href in response.css(self.archive_article_css).getall(): yield response.follow(href, callback=self.parse_article) next_page = response.css(self.archive_next_css).get() if next_page: yield response.follow(next_page, callback=self.parse) def parse_article( self, response: scrapy.http.Response ) -> Iterator[Dict[str, Optional[str]]]: """Extract a minimal article record from a news detail page.""" headline = response.css(self.headline_css).get() published = response.css(self.published_css).get() summary = response.css(self.summary_css).get() item: Dict[str, Optional[str]] = { "headline": headline.strip() if headline else None, "published": published.strip() if published else None, "summary": summary.strip() if summary else None, "url": canonicalize_url(response.url), } yield item
Replace archive_article_css, archive_next_css, plus the field selectors with site-specific CSS selectors from the archive and article HTML.
- Run the spider with a page limit, exporting items to JSON.
$ scrapy crawl news -a start_url=http://app.internal.example:8000/news/ -a max_pages=1 -O news.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO 2026-01-01 09:24:51 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: news.json
High request rates can trigger bans or CAPTCHAs; keep scope small during selector development.
- Inspect the first extracted items to confirm expected fields, URL shapes.
$ head -n 12 news.json [ {"headline": "Usage Tips", "published": "2025-12-15", "summary": "Best practices for keeping crawls stable.", "url": "http://app.internal.example:8000/news/usage-tips.html"}, {"headline": "Launch Update", "published": "2026-01-01", "summary": "New features are now available to all accounts.", "url": "http://app.internal.example:8000/news/launch-update.html"} ] - Count extracted items to validate archive coverage against the configured scope.
$ python -c 'import json; print(len(json.load(open("news.json", encoding="utf-8"))))' 2
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
