Many product, article, and organization pages publish their cleanest structured metadata as JSON-LD inside <script type="application/ld+json"> blocks. Pulling that data in Scrapy is often more reliable than rebuilding the same fields from repeated HTML selectors when the site already exposes the schema object directly.

Scrapy can read those blocks with a normal CSS selector because the JSON-LD script text is part of the fetched HTML response. The usual pattern is to collect the raw script bodies with response.css('script[type="application/ld+json"]::text').getall(), decode each block with Python's json module, then yield only the object whose @type matches the target schema.

One response can contain several JSON-LD blocks, a top-level list, or an @graph array, and one malformed block should not stop the crawl. This workflow only sees the HTML returned to Scrapy, so pages that inject JSON-LD in the browser after JavaScript runs need a rendered-response workflow instead of a plain request.

Steps to extract JSON-LD data with Scrapy:

  1. Open a terminal in the Scrapy project directory.
    $ cd /home/user/jsonld_lab

    Run the command from the directory that contains scrapy.cfg so scrapy shell and scrapy crawl use the correct project settings and spider names.

  2. Start scrapy shell with the page that contains the JSON-LD block.
    $ scrapy shell 'https://shop.example.com/products/starter-plan' --nolog
    [s] Available Scrapy objects:
    [s]   response   <200 https://shop.example.com/products/starter-plan>
    ##### snipped #####
    >>>

    If the browser shows JSON-LD but this response does not, the page is probably adding that schema client-side. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright

  3. Collect each JSON-LD script block before deciding which one to parse.
    >>> blocks = response.css('script[type="application/ld+json"]::text').getall()
    >>> len(blocks)
    3

    Each list entry is the raw script text, so one response can include product data, breadcrumbs, organization metadata, or other schema objects.

  4. Decode the candidate block and confirm the target schema type before writing the spider logic.
    >>> import json
    >>> product = json.loads(blocks[1])["@graph"][0]
    >>> product["@type"]
    'Product'
    >>> product["name"]
    'Starter Plan'

    An @graph payload can contain several schema objects inside one script block, so checking @type first prevents extracting the wrong node.

  5. Replace the spider with JSON-LD extraction logic that filters for Product objects and skips malformed blocks.
    jsonld_lab/spiders/jsonld_product.py
    import json
    from json import JSONDecodeError
     
    import scrapy
     
     
    class JsonldProductSpider(scrapy.Spider):
        name = "jsonld_product"
        start_urls = ["https://shop.example.com/products/starter-plan"]
     
        def parse(self, response):
            for raw in response.css('script[type="application/ld+json"]::text').getall():
                for obj in self.iter_jsonld_objects(raw):
                    if not self.is_target_type(obj, "Product"):
                        continue
     
                    offers = obj.get("offers") or {}
                    yield {
                        "name": obj.get("name"),
                        "sku": obj.get("sku"),
                        "price": offers.get("price"),
                        "currency": offers.get("priceCurrency"),
                        "url": response.url,
                        "jsonld_type": obj.get("@type"),
                    }
     
        def iter_jsonld_objects(self, raw):
            try:
                data = json.loads(raw.strip())
            except JSONDecodeError:
                return
     
            if isinstance(data, dict) and isinstance(data.get("@graph"), list):
                for node in data["@graph"]:
                    if isinstance(node, dict):
                        yield node
                return
     
            if isinstance(data, list):
                for node in data:
                    if isinstance(node, dict):
                        yield node
                return
     
            if isinstance(data, dict):
                yield data
     
        def is_target_type(self, obj, target):
            value = obj.get("@type")
     
            if isinstance(value, str):
                return value == target
     
            if isinstance(value, list):
                return target in value
     
            return False

    Skipping JSONDecodeError keeps one broken JSON-LD block from aborting a page that still contains a valid target object elsewhere in the response.

  6. Run the spider and overwrite the JSON export for the current crawl.
    $ scrapy crawl jsonld_product --overwrite-output product.json
    ##### snipped #####
    2026-04-22 07:22:41 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json
    2026-04-22 07:22:41 [scrapy.core.engine] INFO: Spider closed (finished)

    --overwrite-output replaces any existing product.json at that path.

  7. Parse the exported file once to confirm the crawl wrote valid JSON with the expected fields.
    $ python3 -m json.tool product.json
    [
        {
            "name": "Starter Plan",
            "sku": "starter-plan",
            "price": "29",
            "currency": "USD",
            "url": "https://shop.example.com/products/starter-plan",
            "jsonld_type": "Product"
        }
    ]

    A successful parse confirms that the crawl kept valid JSON output even when the page also contained unrelated schema blocks and one malformed JSON-LD script.