Many modern sites publish their cleanest product, article, or organization metadata as JSON-LD, so extracting that payload in Scrapy is often more reliable than rebuilding the same fields from nested HTML selectors.
Scrapy does not need special JSON-LD support to read those blocks because the response body still contains the <script type="application/ld+json"> elements. A normal response.css() selector can pull the raw script text, and Python's json module can decode each block into dictionaries, lists, or an @graph structure before the spider filters for the target schema type.
Pages commonly include several JSON-LD blocks on the same response, and one malformed block should not abort the crawl. Test the selector in scrapy shell first, then keep the spider focused on the expected @type and skip blocks that raise JSONDecodeError so a broken breadcrumb or marketing snippet does not stop the export.
Related: How to use Scrapy shell
Related: How to use CSS selectors in Scrapy
Steps to extract JSON-LD data with Scrapy:
- Open a terminal in the Scrapy project directory.
$ cd /srv/jsonld_lab
Run the command from the directory that contains scrapy.cfg so Scrapy loads the correct project settings and spider names.
- Start scrapy shell with the page that contains the JSON-LD block.
$ scrapy shell http://app.internal.example:8000/jsonld/ --nolog [s] Available Scrapy objects: [s] response <200 http://app.internal.example:8000/jsonld/> ##### snipped ##### >>>
- Extract every application/ld+json block from the response before deciding which one to parse.
>>> blocks = response.css('script[type="application/ld+json"]::text').getall() >>> len(blocks) 3Each list entry is the raw script text, so multiple results commonly include breadcrumbs, organization data, and the primary content object.
- Preview a candidate block and confirm the target schema type before writing the spider logic.
>>> blocks[1].strip().replace("\n", "")[:190] '{ "@context": "https://schema.org", "@graph": [ { "@type": "Product", "name": "Starter Plan", "sku": "starter-plan", "offer' >>> import json >>> json.loads(blocks[1])["@graph"][0]["@type"] 'Product'An @graph payload can contain several objects in one script block, so checking the type first prevents extracting the wrong node.
- Replace the spider with JSON-LD extraction logic that filters for the target schema type and skips malformed blocks.
- jsonld_lab/spiders/jsonld_product.py
import json from json import JSONDecodeError import scrapy class JsonldProductSpider(scrapy.Spider): name = "jsonld_product" start_urls = ["http://app.internal.example:8000/jsonld/"] def parse(self, response): for raw in response.css('script[type="application/ld+json"]::text').getall(): for obj in self.iter_jsonld_objects(raw): if not self.is_target_type(obj, "Product"): continue offers = obj.get("offers") or {} yield { "name": obj.get("name"), "sku": obj.get("sku"), "price": offers.get("price"), "currency": offers.get("priceCurrency"), "url": response.url, "jsonld_type": obj.get("@type"), } def iter_jsonld_objects(self, raw): try: data = json.loads(raw.strip()) except JSONDecodeError: return if isinstance(data, dict) and isinstance(data.get("@graph"), list): for node in data["@graph"]: if isinstance(node, dict): yield node return if isinstance(data, list): for node in data: if isinstance(node, dict): yield node return if isinstance(data, dict): yield data def is_target_type(self, obj, target): value = obj.get("@type") if isinstance(value, str): return value == target if isinstance(value, list): return target in value return False
Skipping JSONDecodeError keeps one broken JSON-LD block from aborting a page that still contains a valid target object elsewhere in the response.
Related: How to create a Scrapy spider
- Run the spider and overwrite the JSON export for the current crawl.
$ scrapy crawl jsonld_product --overwrite-output product.json ##### snipped ##### 2026-04-16 05:57:55 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: product.json 2026-04-16 05:57:55 [scrapy.core.engine] INFO: Spider closed (finished)
--overwrite-output replaces any existing product.json at that path.
Related: How to export Scrapy items to JSON
- Parse the exported file once to confirm the crawl wrote valid JSON with the expected fields.
$ python3 -m json.tool product.json [ { "name": "Starter Plan", "sku": "starter-plan", "price": "29", "currency": "USD", "url": "http://app.internal.example:8000/jsonld/", "jsonld_type": "Product" } ]A successful parse confirms the export stayed valid even though the page also contained a breadcrumb block and one malformed JSON-LD script.
Notes
- Keep the shell check focused on the raw script blocks first, because selector mistakes are easier to fix there than inside a full crawl run.
- Handle @graph and top-level lists before filtering on @type, because many sites wrap the target object instead of publishing it as a single flat dictionary.
- Adjust the target schema type and exported fields for the object you actually need, such as NewsArticle, Organization, or Recipe.
- If the HTML response contains no matching JSON-LD blocks but the browser does, the site is probably adding them client-side and the workflow needs a rendered response instead of a plain request.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
