Many modern websites publish key page metadata as JSON-LD for search engines and rich previews. Pulling those fields directly avoids brittle HTML scraping and yields consistent values for titles, publication dates, authors, and canonical URLs across different templates.

Most JSON-LD appears inside one or more <script type="application/ld+json"> elements embedded in the HTML. Scrapy can select those script blocks with a CSS selector, parse the JSON payload, and extract the target object by its schema.org type such as @type=NewsArticle or @type=Product.

Pages often include multiple application/ld+json blocks, arrays of objects, or an @graph section that nests several objects in one payload. Some sites also emit malformed JSON due to templating, so parsing must tolerate json.JSONDecodeError and skip bad blocks instead of failing the crawl.

Steps to extract JSON-LD data with Scrapy:

  1. Start the Scrapy shell for the target URL.
    $ scrapy shell http://app.internal.example:8000/jsonld/
  2. Extract all application/ld+json blocks from the response.
    >> blocks = response.css('script[type="application/ld+json"]::text').getall()
    >>> len(blocks)
    1

    Multiple blocks commonly include BreadcrumbList, Organization, and the primary content type.

  3. Preview a JSON-LD block to confirm the expected @type is present.
    >> blocks[0].strip().replace("\n", "")[:160]
    {  "@context": "https://schema.org",  "@type": "Product",  "name": "Starter Plan",  "sku": "starter-plan",  "offers": {    "@type": "Offer",    "price": "29",  
  4. Add JSON-LD extraction logic to a spider.
    import json
    from json import JSONDecodeError
    from typing import Any, Dict, Iterable, Optional, Set
     
    import scrapy
     
     
    class JsonldProductSpider(scrapy.Spider):
        name = "jsonld_product"
        start_urls = ["http://app.internal.example:8000/jsonld/"]
     
        target_types: Set[str] = {"Product"}
     
        def parse(self, response: scrapy.http.Response) -> Iterable[Dict[str, Any]]:
            blocks = response.css('script[type="application/ld+json"]::text').getall()
            for raw in blocks:
                for obj in self._iter_jsonld_objects(raw):
                    if not self._is_target_type(obj):
                        continue
     
                    offers = obj.get("offers") or {}
                    yield {
                        "name": obj.get("name"),
                        "sku": obj.get("sku"),
                        "price": offers.get("price"),
                        "currency": offers.get("priceCurrency"),
                        "url": response.url,
                        "jsonld_type": obj.get("@type"),
                    }
                    return
     
        def _iter_jsonld_objects(self, raw: str) -> Iterable[Dict[str, Any]]:
            try:
                data: Any = json.loads(raw.strip())
            except JSONDecodeError:
                return
     
            if isinstance(data, dict) and isinstance(data.get("@graph"), list):
                for node in data["@graph"]:
                    if isinstance(node, dict):
                        yield node
                return
     
            if isinstance(data, list):
                for node in data:
                    if isinstance(node, dict):
                        yield node
                return
     
            if isinstance(data, dict):
                yield data
     
        def _is_target_type(self, obj: Dict[str, Any]) -> bool:
            value: Any = obj.get("@type")
            types: Set[str] = set()
     
            if isinstance(value, str):
                types.add(value)
     
            if isinstance(value, list):
                for v in value:
                    if isinstance(v, str):
                        types.add(v)
     
            if types & self.target_types:
                return True
     
            return False
     
        def _author_name(self, author: Any) -> Optional[str]:
            candidate: Any = author
     
            if isinstance(candidate, list) and candidate:
                candidate = candidate[0]
     
            if isinstance(candidate, dict):
                name = candidate.get("name")
                if isinstance(name, str):
                    return name
     
            if isinstance(candidate, str):
                return candidate
     
            return None

    Unhandled JSON-LD parse errors can terminate the spider run; skipping blocks that raise json.JSONDecodeError keeps the crawl running.

  5. Run the spider with JSON feed export enabled.
    $ scrapy crawl jsonld_product -O product.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
    2026-01-01 09:12:27 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json

    The -O option overwrites the output file if it already exists.

  6. Inspect the exported item to confirm the extracted fields.
    $ cat product.json
    [
      {
        "name": "Starter Plan",
        "sku": "starter-plan",
        "price": "29",
        "currency": "USD",
        "url": "http://app.internal.example:8000/jsonld/",
        "jsonld_type": "Product"
      }
    ]