Many modern websites publish key page metadata as JSON-LD for search engines and rich previews. Pulling those fields directly avoids brittle HTML scraping and yields consistent values for titles, publication dates, authors, and canonical URLs across different templates.
Most JSON-LD appears inside one or more <script type="application/ld+json"> elements embedded in the HTML. Scrapy can select those script blocks with a CSS selector, parse the JSON payload, and extract the target object by its schema.org type such as @type=NewsArticle or @type=Product.
Pages often include multiple application/ld+json blocks, arrays of objects, or an @graph section that nests several objects in one payload. Some sites also emit malformed JSON due to templating, so parsing must tolerate json.JSONDecodeError and skip bad blocks instead of failing the crawl.
Related: How to use Scrapy shell
Related: How to use CSS selectors in Scrapy
Steps to extract JSON-LD data with Scrapy:
- Start the Scrapy shell for the target URL.
$ scrapy shell http://app.internal.example:8000/jsonld/
- Extract all application/ld+json blocks from the response.
>> blocks = response.css('script[type="application/ld+json"]::text').getall() >>> len(blocks) 1Multiple blocks commonly include BreadcrumbList, Organization, and the primary content type.
- Preview a JSON-LD block to confirm the expected @type is present.
>> blocks[0].strip().replace("\n", "")[:160] { "@context": "https://schema.org", "@type": "Product", "name": "Starter Plan", "sku": "starter-plan", "offers": { "@type": "Offer", "price": "29", - Add JSON-LD extraction logic to a spider.
import json from json import JSONDecodeError from typing import Any, Dict, Iterable, Optional, Set import scrapy class JsonldProductSpider(scrapy.Spider): name = "jsonld_product" start_urls = ["http://app.internal.example:8000/jsonld/"] target_types: Set[str] = {"Product"} def parse(self, response: scrapy.http.Response) -> Iterable[Dict[str, Any]]: blocks = response.css('script[type="application/ld+json"]::text').getall() for raw in blocks: for obj in self._iter_jsonld_objects(raw): if not self._is_target_type(obj): continue offers = obj.get("offers") or {} yield { "name": obj.get("name"), "sku": obj.get("sku"), "price": offers.get("price"), "currency": offers.get("priceCurrency"), "url": response.url, "jsonld_type": obj.get("@type"), } return def _iter_jsonld_objects(self, raw: str) -> Iterable[Dict[str, Any]]: try: data: Any = json.loads(raw.strip()) except JSONDecodeError: return if isinstance(data, dict) and isinstance(data.get("@graph"), list): for node in data["@graph"]: if isinstance(node, dict): yield node return if isinstance(data, list): for node in data: if isinstance(node, dict): yield node return if isinstance(data, dict): yield data def _is_target_type(self, obj: Dict[str, Any]) -> bool: value: Any = obj.get("@type") types: Set[str] = set() if isinstance(value, str): types.add(value) if isinstance(value, list): for v in value: if isinstance(v, str): types.add(v) if types & self.target_types: return True return False def _author_name(self, author: Any) -> Optional[str]: candidate: Any = author if isinstance(candidate, list) and candidate: candidate = candidate[0] if isinstance(candidate, dict): name = candidate.get("name") if isinstance(name, str): return name if isinstance(candidate, str): return candidate return None
Unhandled JSON-LD parse errors can terminate the spider run; skipping blocks that raise json.JSONDecodeError keeps the crawl running.
- Run the spider with JSON feed export enabled.
$ scrapy crawl jsonld_product -O product.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO 2026-01-01 09:12:27 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json
The -O option overwrites the output file if it already exists.
- Inspect the exported item to confirm the extracted fields.
$ cat product.json [ { "name": "Starter Plan", "sku": "starter-plan", "price": "29", "currency": "USD", "url": "http://app.internal.example:8000/jsonld/", "jsonld_type": "Product" } ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
