Many product, article, and organization pages publish their cleanest structured metadata as JSON-LD inside <script type="application/ld+json"> blocks. Pulling that data in Scrapy is often more reliable than rebuilding the same fields from repeated HTML selectors when the site already exposes the schema object directly.
Scrapy can read those blocks with a normal CSS selector because the JSON-LD script text is part of the fetched HTML response. The usual pattern is to collect the raw script bodies with response.css('script[type="application/ld+json"]::text').getall(), decode each block with Python's json module, then yield only the object whose @type matches the target schema.
One response can contain several JSON-LD blocks, a top-level list, or an @graph array, and one malformed block should not stop the crawl. This workflow only sees the HTML returned to Scrapy, so pages that inject JSON-LD in the browser after JavaScript runs need a rendered-response workflow instead of a plain request.
Related: How to use Scrapy shell
Related: How to use CSS selectors in Scrapy
$ cd /home/user/jsonld_lab
Run the command from the directory that contains scrapy.cfg so scrapy shell and scrapy crawl use the correct project settings and spider names.
$ scrapy shell 'https://shop.example.com/products/starter-plan' --nolog [s] Available Scrapy objects: [s] response <200 https://shop.example.com/products/starter-plan> ##### snipped ##### >>>
If the browser shows JSON-LD but this response does not, the page is probably adding that schema client-side. Related: How to scrape a JavaScript-rendered page with Scrapy using Playwright
>>> blocks = response.css('script[type="application/ld+json"]::text').getall()
>>> len(blocks)
3
Each list entry is the raw script text, so one response can include product data, breadcrumbs, organization metadata, or other schema objects.
>>> import json >>> product = json.loads(blocks[1])["@graph"][0] >>> product["@type"] 'Product' >>> product["name"] 'Starter Plan'
An @graph payload can contain several schema objects inside one script block, so checking @type first prevents extracting the wrong node.
import json from json import JSONDecodeError import scrapy class JsonldProductSpider(scrapy.Spider): name = "jsonld_product" start_urls = ["https://shop.example.com/products/starter-plan"] def parse(self, response): for raw in response.css('script[type="application/ld+json"]::text').getall(): for obj in self.iter_jsonld_objects(raw): if not self.is_target_type(obj, "Product"): continue offers = obj.get("offers") or {} yield { "name": obj.get("name"), "sku": obj.get("sku"), "price": offers.get("price"), "currency": offers.get("priceCurrency"), "url": response.url, "jsonld_type": obj.get("@type"), } def iter_jsonld_objects(self, raw): try: data = json.loads(raw.strip()) except JSONDecodeError: return if isinstance(data, dict) and isinstance(data.get("@graph"), list): for node in data["@graph"]: if isinstance(node, dict): yield node return if isinstance(data, list): for node in data: if isinstance(node, dict): yield node return if isinstance(data, dict): yield data def is_target_type(self, obj, target): value = obj.get("@type") if isinstance(value, str): return value == target if isinstance(value, list): return target in value return False
Skipping JSONDecodeError keeps one broken JSON-LD block from aborting a page that still contains a valid target object elsewhere in the response.
Related: How to create a Scrapy spider
$ scrapy crawl jsonld_product --overwrite-output product.json ##### snipped ##### 2026-04-22 07:22:41 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json 2026-04-22 07:22:41 [scrapy.core.engine] INFO: Spider closed (finished)
--overwrite-output replaces any existing product.json at that path.
Related: How to export Scrapy items to JSON
$ python3 -m json.tool product.json
[
{
"name": "Starter Plan",
"sku": "starter-plan",
"price": "29",
"currency": "USD",
"url": "https://shop.example.com/products/starter-plan",
"jsonld_type": "Product"
}
]
A successful parse confirms that the crawl kept valid JSON output even when the page also contained unrelated schema blocks and one malformed JSON-LD script.