Scraping a GraphQL endpoint extracts structured data from modern web apps without brittle HTML parsing. Pulling products, listings, or search results directly from the API reduces breakage from front-end redesigns and keeps the crawl focused on the data that matters.
Most GraphQL implementations expose a single HTTP endpoint that accepts a JSON POST body containing a query document plus optional variables. The response is JSON and typically includes a data object for requested fields, plus an errors array when validation fails or resolvers return exceptions.
GraphQL requests frequently require browser-matching headers, session cookies, CSRF tokens, or bearer tokens to succeed. Keep selection sets minimal (request only needed fields) and prefer pagination via variables (limit, cursor) to avoid large payloads and rate limiting. Always watch for populated errors even when an HTTP 200 response is returned.
Related: How to scrape a JSON API with Scrapy
Related: How to set request headers in Scrapy
Locate the POST request to /graphql (or similar), copy the query and variables payload, and note required headers such as Authorization, Cookie, or X-CSRF-Token.
Captured session cookies and auth tokens grant access to the same account context as the browser session and must be treated like passwords.
$ mkdir -p simplifiedguide/spiders
$ vi simplifiedguide/spiders/catalog_api.py
import json import os import scrapy class CatalogApiSpider(scrapy.Spider): name = "catalog_api" allowed_domains = ["api.example.net"] api_url = "http://api.example.net:8000/graphql" def start_requests(self): payload = { "query": """ query Products($limit: Int!) { products(limit: $limit) { id name price } } """.strip(), "variables": {"limit": 3}, } headers = { "Content-Type": "application/json", "Accept": "application/json", } token = os.environ.get("EXAMPLE_API_TOKEN") if token: headers["Authorization"] = f"Bearer {token}" yield scrapy.Request( url=self.api_url, method="POST", body=json.dumps(payload), headers=headers, callback=self.parse_api, ) def parse_api(self, response): data = json.loads(response.text) errors = data.get("errors") or [] for error in errors: self.logger.warning("GraphQL error: %s", error) products = data.get("data", {}).get("products") or [] for product in products: yield { "id": product.get("id"), "name": product.get("name"), "price": product.get("price"), }
Add "operationName" to the payload when the endpoint requires it, and mirror any required headers from the browser request when authentication or CSRF checks are enforced.
$ scrapy crawl catalog_api -O products.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO 2026-01-01 09:10:57 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: products.json
$ head -n 5 products.json
[
{"id": "p-0001", "name": "Starter Plan", "price": 29},
{"id": "p-0002", "name": "Team Plan", "price": 79},
{"id": "p-0003", "name": "Enterprise Plan", "price": 199}
]