Scraping a GraphQL API with Scrapy pulls structured records from the same application data layer that powers many product listings, search results, and account dashboards. That avoids brittle HTML selectors when the useful data is already exposed as typed fields behind one API endpoint.
Most GraphQL applications accept a JSON POST request that contains a query document, an optional operationName, and a variables object. A Scrapy spider can send that payload directly with JsonRequest, read the JSON response, and yield item dictionaries from the returned data tree without adding manual JSON serialization boilerplate.
Authentication, CSRF checks, and pagination rules still depend on the target application, so the spider must mirror the required headers, cookies, and variables from a real browser request. Current Scrapy releases also use async def start() for initial spider requests, and GraphQL APIs can return an HTTP 200 response that still contains an errors array, so success checks should inspect the body instead of trusting status code alone.
Related: How to scrape a JSON API with Scrapy
Related: How to set request headers in Scrapy
Steps to scrape a GraphQL API with Scrapy:
- Create a new Scrapy project for the API crawl.
$ scrapy startproject graphql_scrape New Scrapy project 'graphql_scrape', using template directory '##### snipped #####', created in: /srv/graphql_scrape You can start your first spider with: cd graphql_scrape scrapy genspider example example.com - Change into the project directory.
$ cd graphql_scrape
- Generate a spider file for the API host.
$ scrapy genspider products api.example.net Created spider 'products' using template 'basic' in module: graphql_scrape.spiders.products
- Replace the generated spider with a JsonRequest that sends the GraphQL query, variables, and optional auth header.
- graphql_scrape/spiders/products.py
from os import environ import scrapy from scrapy.http import JsonRequest class ProductsSpider(scrapy.Spider): name = "products" allowed_domains = ["api.example.net"] api_url = "https://api.example.net/graphql" async def start(self): token = environ.get("EXAMPLE_API_TOKEN") query = """ query Products($first: Int!) { products(first: $first) { nodes { id name price currency url } } } """.strip() headers = {} if token: headers["Authorization"] = f"Bearer {token}" yield JsonRequest( url=self.api_url, data={ "operationName": "Products", "query": query, "variables": {"first": 4}, }, headers=headers, callback=self.parse, ) def parse(self, response): payload = response.json() for error in payload.get("errors", []): self.logger.warning("GraphQL error: %s", error) products = payload.get("data", {}).get("products", {}).get("nodes", []) for product in products: yield { "id": product.get("id"), "name": product.get("name"), "price": product.get("price"), "currency": product.get("currency"), "url": product.get("url"), }
JsonRequest sets the JSON request body and current Scrapy releases switch the request method to POST automatically when data= is provided.
Load bearer tokens, CSRF headers, and session cookies from environment variables, settings, or a secret store instead of hardcoding live credentials into the spider file.
- Run the spider and overwrite the JSON export file for the current crawl.
$ scrapy crawl products -O products.json 2026-04-16 06:12:07 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: graphql_scrape) ##### snipped ##### 2026-04-16 06:12:10 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: products.json 2026-04-16 06:12:10 [scrapy.core.engine] INFO: Spider closed (finished)
The -O option writes one complete JSON array and replaces any existing file at the same path.
- Count the exported items once to confirm that the JSON file closed cleanly.
$ python3 -c "import json; print(len(json.load(open('products.json', encoding='utf-8'))))" 4A numeric count confirms that downstream tools can parse the export as valid JSON instead of a truncated partial file.
- Review the exported records before wiring the spider into a larger pipeline.
$ cat products.json [ {"id": "p-0001", "name": "Starter Plan", "price": 29, "currency": "USD", "url": "https://api.example.net/products/starter-plan"}, {"id": "p-0002", "name": "Team Plan", "price": 79, "currency": "USD", "url": "https://api.example.net/products/team-plan"}, {"id": "p-0003", "name": "Growth Plan", "price": 129, "currency": "USD", "url": "https://api.example.net/products/growth-plan"}, {"id": "p-0004", "name": "Scale Plan", "price": 249, "currency": "USD", "url": "https://api.example.net/products/scale-plan"} ]
Notes
- Add the exact browser-captured operationName, variables, and field selection from the target application when the endpoint rejects simplified sample payloads.
- Pass login state through Request.cookies or the cookie middleware instead of writing a raw Cookie header when the API depends on session cookies.
- Keep the field selection narrow so each response stays small and the crawl only requests the fields that the export actually uses.
- Cursor-based pagination usually means sending a second JsonRequest to the same endpoint with an updated variable such as after or cursor rather than following a new URL.
- Handle populated errors arrays explicitly even when the HTTP status is 200, because partial data and resolver failures can arrive in the same response body.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
