How to scrape a GraphQL API with Scrapy

Scraping a GraphQL endpoint extracts structured data from modern web apps without brittle HTML parsing. Pulling products, listings, or search results directly from the API reduces breakage from front-end redesigns and keeps the crawl focused on the data that matters.

Most GraphQL implementations expose a single HTTP endpoint that accepts a JSON POST body containing a query document plus optional variables. The response is JSON and typically includes a data object for requested fields, plus an errors array when validation fails or resolvers return exceptions.

GraphQL requests frequently require browser-matching headers, session cookies, CSRF tokens, or bearer tokens to succeed. Keep selection sets minimal (request only needed fields) and prefer pagination via variables (limit, cursor) to avoid large payloads and rate limiting. Always watch for populated errors even when an HTTP 200 response is returned.

Steps to scrape a GraphQL API with Scrapy:

Capture the GraphQL endpoint URL, request headers, and request body from a real app request.

Locate the POST request to /graphql (or similar), copy the query and variables payload, and note required headers such as Authorization, Cookie, or X-CSRF-Token.

Captured session cookies and auth tokens grant access to the same account context as the browser session and must be treated like passwords.
Create the spider directory inside the Scrapy project.
```
$ mkdir -p simplifiedguide/spiders
```

Open a new spider file for the API crawl.

$ vi simplifiedguide/spiders/catalog_api.py

Define the spider to POST the GraphQL payload with JSON parsing logic.

import json
import os
 
import scrapy
 
 
class CatalogApiSpider(scrapy.Spider):
    name = "catalog_api"
    allowed_domains = ["api.example.net"]
    api_url = "http://api.example.net:8000/graphql"
 
    def start_requests(self):
        payload = {
            "query": """
query Products($limit: Int!) {
  products(limit: $limit) {
    id
    name
    price
  }
}
""".strip(),
            "variables": {"limit": 3},
        }
 
        headers = {
            "Content-Type": "application/json",
            "Accept": "application/json",
        }
 
        token = os.environ.get("EXAMPLE_API_TOKEN")
        if token:
            headers["Authorization"] = f"Bearer {token}"
 
        yield scrapy.Request(
            url=self.api_url,
            method="POST",
            body=json.dumps(payload),
            headers=headers,
            callback=self.parse_api,
        )
 
    def parse_api(self, response):
        data = json.loads(response.text)
 
        errors = data.get("errors") or []
        for error in errors:
            self.logger.warning("GraphQL error: %s", error)
 
        products = data.get("data", {}).get("products") or []
        for product in products:
            yield {
                "id": product.get("id"),
                "name": product.get("name"),
                "price": product.get("price"),
            }

Add "operationName" to the payload when the endpoint requires it, and mirror any required headers from the browser request when authentication or CSRF checks are enforced.

Run the spider with JSON feed export enabled.

$ scrapy crawl catalog_api -O products.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
2026-01-01 09:10:57 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: products.json

Inspect the exported JSON file to confirm the expected fields are present.

$ head -n 5 products.json
[
{"id": "p-0001", "name": "Starter Plan", "price": 29},
{"id": "p-0002", "name": "Team Plan", "price": 79},
{"id": "p-0003", "name": "Enterprise Plan", "price": 199}
]

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.