Many product, article, and organization pages publish their cleanest structured metadata as JSON-LD inside <script type="application/ld+json"> blocks. Pulling that data in Scrapy is often more reliable than rebuilding the same fields from repeated HTML selectors when the site already exposes the schema object directly.
Scrapy can read those blocks with a normal CSS selector because the JSON-LD script text is part of the fetched HTML response. The usual pattern is to collect the raw script bodies with response.css('script[type="application/ld+json"]::text').getall(), decode each block with Python's json module, then yield only the object whose @type matches the target schema.
One response can contain several JSON-LD blocks, a top-level list, or an @graph array, and one malformed block should not stop the crawl. This workflow only sees the HTML returned to Scrapy, so pages that inject JSON-LD in the browser after JavaScript runs need a rendered-response workflow instead of a plain request.
Related: How to use Scrapy shell
Related: How to use CSS selectors in Scrapy
Steps to extract JSON-LD data with Scrapy:
- Open a terminal in the Scrapy project directory.
$ cd /home/user/jsonld_lab
Run the command from the directory that contains scrapy.cfg so scrapy shell and scrapy crawl use the correct project settings and spider names.
- Start scrapy shell with the page that contains the JSON-LD block.
$ scrapy shell 'https://shop.example.com/products/starter-plan' --nolog [s] Available Scrapy objects: [s] response <200 https://shop.example.com/products/starter-plan> ##### snipped ##### >>>
If the browser shows JSON-LD but this response does not, the page is probably adding that schema client-side. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
- Collect each JSON-LD script block before deciding which one to parse.
>>> blocks = response.css('script[type="application/ld+json"]::text').getall() >>> len(blocks) 3Each list entry is the raw script text, so one response can include product data, breadcrumbs, organization metadata, or other schema objects.
- Decode the candidate block and confirm the target schema type before writing the spider logic.
>>> import json >>> product = json.loads(blocks[1])["@graph"][0] >>> product["@type"] 'Product' >>> product["name"] 'Starter Plan'
An @graph payload can contain several schema objects inside one script block, so checking @type first prevents extracting the wrong node.
- Replace the spider with JSON-LD extraction logic that filters for Product objects and skips malformed blocks.
- jsonld_lab/spiders/jsonld_product.py
import json from json import JSONDecodeError import scrapy class JsonldProductSpider(scrapy.Spider): name = "jsonld_product" start_urls = ["https://shop.example.com/products/starter-plan"] def parse(self, response): for raw in response.css('script[type="application/ld+json"]::text').getall(): for obj in self.iter_jsonld_objects(raw): if not self.is_target_type(obj, "Product"): continue offers = obj.get("offers") or {} yield { "name": obj.get("name"), "sku": obj.get("sku"), "price": offers.get("price"), "currency": offers.get("priceCurrency"), "url": response.url, "jsonld_type": obj.get("@type"), } def iter_jsonld_objects(self, raw): try: data = json.loads(raw.strip()) except JSONDecodeError: return if isinstance(data, dict) and isinstance(data.get("@graph"), list): for node in data["@graph"]: if isinstance(node, dict): yield node return if isinstance(data, list): for node in data: if isinstance(node, dict): yield node return if isinstance(data, dict): yield data def is_target_type(self, obj, target): value = obj.get("@type") if isinstance(value, str): return value == target if isinstance(value, list): return target in value return False
Skipping JSONDecodeError keeps one broken JSON-LD block from aborting a page that still contains a valid target object elsewhere in the response.
Related: How to create a Scrapy spider
- Run the spider and overwrite the JSON export for the current crawl.
$ scrapy crawl jsonld_product --overwrite-output product.json ##### snipped ##### 2026-04-22 07:22:41 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json 2026-04-22 07:22:41 [scrapy.core.engine] INFO: Spider closed (finished)
--overwrite-output replaces any existing product.json at that path.
Related: How to export Scrapy items to JSON
- Parse the exported file once to confirm the crawl wrote valid JSON with the expected fields.
$ python3 -m json.tool product.json [ { "name": "Starter Plan", "sku": "starter-plan", "price": "29", "currency": "USD", "url": "https://shop.example.com/products/starter-plan", "jsonld_type": "Product" } ]A successful parse confirms that the crawl kept valid JSON output even when the page also contained unrelated schema blocks and one malformed JSON-LD script.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
