How to extract JSON-LD data with Scrapy

Many modern websites publish key page metadata as JSON-LD for search engines and rich previews. Pulling those fields directly avoids brittle HTML scraping and yields consistent values for titles, publication dates, authors, and canonical URLs across different templates.

Most JSON-LD appears inside one or more <script type="application/ld+json"> elements embedded in the HTML. Scrapy can select those script blocks with a CSS selector, parse the JSON payload, and extract the target object by its schema.org type such as @type=NewsArticle or @type=Product.

Pages often include multiple application/ld+json blocks, arrays of objects, or an @graph section that nests several objects in one payload. Some sites also emit malformed JSON due to templating, so parsing must tolerate json.JSONDecodeError and skip bad blocks instead of failing the crawl.

Steps to extract JSON-LD data with Scrapy:

Start the Scrapy shell for the target URL.

$ scrapy shell http://app.internal.example:8000/jsonld/

Extract all application/ld+json blocks from the response.
```
>> blocks = response.css('script[type="application/ld+json"]::text').getall()
>>> len(blocks)
1
```
Multiple blocks commonly include BreadcrumbList, Organization, and the primary content type.

Preview a JSON-LD block to confirm the expected @type is present.

>> blocks[0].strip().replace("\n", "")[:160]
{  "@context": "https://schema.org",  "@type": "Product",  "name": "Starter Plan",  "sku": "starter-plan",  "offers": {    "@type": "Offer",    "price": "29",

Add JSON-LD extraction logic to a spider.

import json
from json import JSONDecodeError
from typing import Any, Dict, Iterable, Optional, Set
 
import scrapy
 
 
class JsonldProductSpider(scrapy.Spider):
    name = "jsonld_product"
    start_urls = ["http://app.internal.example:8000/jsonld/"]
 
    target_types: Set[str] = {"Product"}
 
    def parse(self, response: scrapy.http.Response) -> Iterable[Dict[str, Any]]:
        blocks = response.css('script[type="application/ld+json"]::text').getall()
        for raw in blocks:
            for obj in self._iter_jsonld_objects(raw):
                if not self._is_target_type(obj):
                    continue
 
                offers = obj.get("offers") or {}
                yield {
                    "name": obj.get("name"),
                    "sku": obj.get("sku"),
                    "price": offers.get("price"),
                    "currency": offers.get("priceCurrency"),
                    "url": response.url,
                    "jsonld_type": obj.get("@type"),
                }
                return
 
    def _iter_jsonld_objects(self, raw: str) -> Iterable[Dict[str, Any]]:
        try:
            data: Any = json.loads(raw.strip())
        except JSONDecodeError:
            return
 
        if isinstance(data, dict) and isinstance(data.get("@graph"), list):
            for node in data["@graph"]:
                if isinstance(node, dict):
                    yield node
            return
 
        if isinstance(data, list):
            for node in data:
                if isinstance(node, dict):
                    yield node
            return
 
        if isinstance(data, dict):
            yield data
 
    def _is_target_type(self, obj: Dict[str, Any]) -> bool:
        value: Any = obj.get("@type")
        types: Set[str] = set()
 
        if isinstance(value, str):
            types.add(value)
 
        if isinstance(value, list):
            for v in value:
                if isinstance(v, str):
                    types.add(v)
 
        if types & self.target_types:
            return True
 
        return False
 
    def _author_name(self, author: Any) -> Optional[str]:
        candidate: Any = author
 
        if isinstance(candidate, list) and candidate:
            candidate = candidate[0]
 
        if isinstance(candidate, dict):
            name = candidate.get("name")
            if isinstance(name, str):
                return name
 
        if isinstance(candidate, str):
            return candidate
 
        return None

Unhandled JSON-LD parse errors can terminate the spider run; skipping blocks that raise json.JSONDecodeError keeps the crawl running.

Run the spider with JSON feed export enabled.

$ scrapy crawl jsonld_product -O product.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
2026-01-01 09:12:27 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json

The -O option overwrites the output file if it already exists.

Inspect the exported item to confirm the extracted fields.

$ cat product.json
[
  {
    "name": "Starter Plan",
    "sku": "starter-plan",
    "price": "29",
    "currency": "USD",
    "url": "http://app.internal.example:8000/jsonld/",
    "jsonld_type": "Product"
  }
]

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.