Many modern sites publish their cleanest product, article, or organization metadata as JSON-LD, so extracting that payload in Scrapy is often more reliable than rebuilding the same fields from nested HTML selectors.

Scrapy does not need special JSON-LD support to read those blocks because the response body still contains the <script type="application/ld+json"> elements. A normal response.css() selector can pull the raw script text, and Python's json module can decode each block into dictionaries, lists, or an @graph structure before the spider filters for the target schema type.

Pages commonly include several JSON-LD blocks on the same response, and one malformed block should not abort the crawl. Test the selector in scrapy shell first, then keep the spider focused on the expected @type and skip blocks that raise JSONDecodeError so a broken breadcrumb or marketing snippet does not stop the export.

Steps to extract JSON-LD data with Scrapy:

  1. Open a terminal in the Scrapy project directory.
    $ cd /srv/jsonld_lab

    Run the command from the directory that contains scrapy.cfg so Scrapy loads the correct project settings and spider names.

  2. Start scrapy shell with the page that contains the JSON-LD block.
    $ scrapy shell http://app.internal.example:8000/jsonld/ --nolog
    [s] Available Scrapy objects:
    [s]   response   <200 http://app.internal.example:8000/jsonld/>
    ##### snipped #####
    >>>
  3. Extract every application/ld+json block from the response before deciding which one to parse.
    >>> blocks = response.css('script[type="application/ld+json"]::text').getall()
    >>> len(blocks)
    3

    Each list entry is the raw script text, so multiple results commonly include breadcrumbs, organization data, and the primary content object.

  4. Preview a candidate block and confirm the target schema type before writing the spider logic.
    >>> blocks[1].strip().replace("\n", "")[:190]
    '{        "@context": "https://schema.org",        "@graph": [          {            "@type": "Product",            "name": "Starter Plan",            "sku": "starter-plan",            "offer'
    >>> import json
    >>> json.loads(blocks[1])["@graph"][0]["@type"]
    'Product'

    An @graph payload can contain several objects in one script block, so checking the type first prevents extracting the wrong node.

  5. Replace the spider with JSON-LD extraction logic that filters for the target schema type and skips malformed blocks.
    jsonld_lab/spiders/jsonld_product.py
    import json
    from json import JSONDecodeError
     
    import scrapy
     
     
    class JsonldProductSpider(scrapy.Spider):
        name = "jsonld_product"
        start_urls = ["http://app.internal.example:8000/jsonld/"]
     
        def parse(self, response):
            for raw in response.css('script[type="application/ld+json"]::text').getall():
                for obj in self.iter_jsonld_objects(raw):
                    if not self.is_target_type(obj, "Product"):
                        continue
     
                    offers = obj.get("offers") or {}
                    yield {
                        "name": obj.get("name"),
                        "sku": obj.get("sku"),
                        "price": offers.get("price"),
                        "currency": offers.get("priceCurrency"),
                        "url": response.url,
                        "jsonld_type": obj.get("@type"),
                    }
     
        def iter_jsonld_objects(self, raw):
            try:
                data = json.loads(raw.strip())
            except JSONDecodeError:
                return
     
            if isinstance(data, dict) and isinstance(data.get("@graph"), list):
                for node in data["@graph"]:
                    if isinstance(node, dict):
                        yield node
                return
     
            if isinstance(data, list):
                for node in data:
                    if isinstance(node, dict):
                        yield node
                return
     
            if isinstance(data, dict):
                yield data
     
        def is_target_type(self, obj, target):
            value = obj.get("@type")
     
            if isinstance(value, str):
                return value == target
     
            if isinstance(value, list):
                return target in value
     
            return False

    Skipping JSONDecodeError keeps one broken JSON-LD block from aborting a page that still contains a valid target object elsewhere in the response.

  6. Run the spider and overwrite the JSON export for the current crawl.
    $ scrapy crawl jsonld_product --overwrite-output product.json
    ##### snipped #####
    2026-04-16 05:57:55 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json
    2026-04-16 05:57:55 [scrapy.core.engine] INFO: Spider closed (finished)

    --overwrite-output replaces any existing product.json at that path.

  7. Parse the exported file once to confirm the crawl wrote valid JSON with the expected fields.
    $ python3 -m json.tool product.json
    [
        {
            "name": "Starter Plan",
            "sku": "starter-plan",
            "price": "29",
            "currency": "USD",
            "url": "http://app.internal.example:8000/jsonld/",
            "jsonld_type": "Product"
        }
    ]

    A successful parse confirms the export stayed valid even though the page also contained a breadcrumb block and one malformed JSON-LD script.

Notes

  • Keep the shell check focused on the raw script blocks first, because selector mistakes are easier to fix there than inside a full crawl run.
  • Handle @graph and top-level lists before filtering on @type, because many sites wrap the target object instead of publishing it as a single flat dictionary.
  • Adjust the target schema type and exported fields for the object you actually need, such as NewsArticle, Organization, or Recipe.
  • If the HTML response contains no matching JSON-LD blocks but the browser does, the site is probably adding them client-side and the workflow needs a rendered response instead of a plain request.