How to extract JSON-LD data with Scrapy

Many product, article, and organization pages publish their cleanest structured metadata as JSON-LD inside <script type="application/ld+json"> blocks. Pulling that data in Scrapy is often more reliable than rebuilding the same fields from repeated HTML selectors when the site already exposes the schema object directly.

Scrapy can read those blocks with a normal CSS selector because the JSON-LD script text is part of the fetched HTML response. The usual pattern is to collect the raw script bodies with response.css('script[type="application/ld+json"]::text').getall(), decode each block with Python's json module, then yield only the object whose @type matches the target schema.

One response can contain several JSON-LD blocks, a top-level list, or an @graph array, and one malformed block should not stop the crawl. This workflow only sees the HTML returned to Scrapy, so pages that inject JSON-LD in the browser after JavaScript runs need a rendered-response workflow instead of a plain request.

Steps to extract JSON-LD data with Scrapy:

Open a terminal in the Scrapy project directory.
```
$ cd /home/user/jsonld_lab
```
Run the command from the directory that contains scrapy.cfg so scrapy shell and scrapy crawl use the correct project settings and spider names.
Start scrapy shell with the page that contains the JSON-LD block.
```
$ scrapy shell 'https://shop.example.com/products/starter-plan' --nolog
[s] Available Scrapy objects:
[s]   response   <200 https://shop.example.com/products/starter-plan>
##### snipped #####
>>>
```
If the browser shows JSON-LD but this response does not, the page is probably adding that schema client-side. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
Collect each JSON-LD script block before deciding which one to parse.
```
>>> blocks = response.css('script[type="application/ld+json"]::text').getall()
>>> len(blocks)
3
```
Each list entry is the raw script text, so one response can include product data, breadcrumbs, organization metadata, or other schema objects.
Decode the candidate block and confirm the target schema type before writing the spider logic.
```
>>> import json
>>> product = json.loads(blocks[1])["@graph"][0]
>>> product["@type"]
'Product'
>>> product["name"]
'Starter Plan'
```
An @graph payload can contain several schema objects inside one script block, so checking @type first prevents extracting the wrong node.

Tool: JSON Validator

Replace the spider with JSON-LD extraction logic that filters for Product objects and skips malformed blocks.

jsonld_lab/spiders/jsonld_product.py

import json
from json import JSONDecodeError
 
import scrapy
 
 
class JsonldProductSpider(scrapy.Spider):
    name = "jsonld_product"
    start_urls = ["https://shop.example.com/products/starter-plan"]
 
    def parse(self, response):
        for raw in response.css('script[type="application/ld+json"]::text').getall():
            for obj in self.iter_jsonld_objects(raw):
                if not self.is_target_type(obj, "Product"):
                    continue
 
                offers = obj.get("offers") or {}
                yield {
                    "name": obj.get("name"),
                    "sku": obj.get("sku"),
                    "price": offers.get("price"),
                    "currency": offers.get("priceCurrency"),
                    "url": response.url,
                    "jsonld_type": obj.get("@type"),
                }
 
    def iter_jsonld_objects(self, raw):
        try:
            data = json.loads(raw.strip())
        except JSONDecodeError:
            return
 
        if isinstance(data, dict) and isinstance(data.get("@graph"), list):
            for node in data["@graph"]:
                if isinstance(node, dict):
                    yield node
            return
 
        if isinstance(data, list):
            for node in data:
                if isinstance(node, dict):
                    yield node
            return
 
        if isinstance(data, dict):
            yield data
 
    def is_target_type(self, obj, target):
        value = obj.get("@type")
 
        if isinstance(value, str):
            return value == target
 
        if isinstance(value, list):
            return target in value
 
        return False

Skipping JSONDecodeError keeps one broken JSON-LD block from aborting a page that still contains a valid target object elsewhere in the response.

Related: How to create a Scrapy spider

Run the spider and overwrite the JSON export for the current crawl.

$ scrapy crawl jsonld_product --overwrite-output product.json
##### snipped #####
2026-04-22 07:22:41 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json
2026-04-22 07:22:41 [scrapy.core.engine] INFO: Spider closed (finished)

--overwrite-output replaces any existing product.json at that path.

Parse the exported file once to confirm the crawl wrote valid JSON with the expected fields.

$ python3 -m json.tool product.json
[
    {
        "name": "Starter Plan",
        "sku": "starter-plan",
        "price": "29",
        "currency": "USD",
        "url": "https://shop.example.com/products/starter-plan",
        "jsonld_type": "Product"
    }
]

A successful parse confirms that the crawl kept valid JSON output even when the page also contained unrelated schema blocks and one malformed JSON-LD script.