Scraping a JSON API with Scrapy pulls structured records directly from the same data layer many web apps already use internally. That avoids brittle HTML selectors when product listings, search results, or catalog data are already exposed as objects and arrays.

A Scrapy spider requests the API endpoint, reads the response body as JSON, yields item dictionaries, and follows the next page until the API stops returning another URL or cursor. The same crawl can then export its items through Scrapy's feed exporter without adding a separate serialization step.

Authentication headers, cursor rules, and rate limits vary between APIs, so the request headers, pagination fields, and delay settings must match the provider's current behavior. POST-based APIs often expect a JSON request body instead of query parameters, and authenticated endpoints should load tokens from settings or environment variables instead of hardcoding secrets into the spider file.

Steps to scrape a JSON API with Scrapy:

  1. Create a new Scrapy project for the API crawl.
    $ scrapy startproject api_scrape
    New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in:
        /srv/api_scrape
    
    You can start your first spider with:
        cd api_scrape
        scrapy genspider example example.com
  2. Change into the project directory.
    $ cd api_scrape
  3. Generate a spider for the API host.
    $ scrapy genspider products api.example.net
    Created spider 'products' using template 'basic' in module:
      api_scrape.spiders.products
  4. Replace the generated spider with JSON parsing and pagination logic.
    api_scrape/spiders/products.py
    from urllib.parse import urlencode
     
    import scrapy
     
     
    class ProductsSpider(scrapy.Spider):
        name = "products"
        allowed_domains = ["api.example.net"]
        api_endpoint = "https://api.example.net/v1/catalog"
        per_page = 100
     
        async def start(self):
            params = {"page": 1, "per_page": self.per_page}
            yield scrapy.Request(
                url=f"{self.api_endpoint}?{urlencode(params)}",
                headers={"Accept": "application/json"},
                callback=self.parse,
            )
     
        def parse(self, response):
            payload = response.json()
     
            for row in payload.get("products", []):
                yield {
                    "id": row.get("id"),
                    "name": row.get("name"),
                    "price": row.get("price"),
                    "currency": row.get("currency"),
                    "url": row.get("url"),
                }
     
            next_url = payload.get("next")
            if next_url:
                yield response.follow(
                    next_url,
                    headers={"Accept": "application/json"},
                    callback=self.parse,
                )

    The example expects item data under products and the next-page URL under next. Use scrapy.http.JsonRequest instead of Request when the API expects a JSON request body, and add headers such as Authorization when the endpoint is authenticated.

  5. Add cautious request pacing to the project settings.
    api_scrape/settings.py
    DOWNLOAD_DELAY = 1.0
    CONCURRENT_REQUESTS_PER_DOMAIN = 2
     
    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 1.0
    AUTOTHROTTLE_MAX_DELAY = 10.0
     
    FEED_EXPORT_ENCODING = "utf-8"

    Aggressive request rates often trigger HTTP 429 responses, temporary bans, or truncated datasets when an API enforces per-client quotas.

  6. Run the spider and write the collected items to a JSON file.
    $ scrapy crawl products -O products.json
    ##### snipped #####
    2026-04-16 05:51:23 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: products.json
    2026-04-16 05:51:23 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites an existing file and keeps the result as one valid JSON array when the crawl finishes cleanly.

  7. Parse the exported file once to confirm the crawl produced valid JSON.
    $ python3 -c "import json; print(len(json.load(open('products.json', encoding='utf-8'))))"
    4

    A numeric count confirms the exporter closed the JSON array correctly and that downstream tools can read the file.

  8. Review the exported records before wiring the crawl into a larger pipeline.
    $ cat products.json
    [
    {"id": "p-0001", "name": "Starter Plan", "price": 29, "currency": "USD", "url": "https://api.example.net/products/starter-plan"},
    {"id": "p-0002", "name": "Team Plan", "price": 79, "currency": "USD", "url": "https://api.example.net/products/team-plan"},
    {"id": "p-0003", "name": "Growth Plan", "price": 129, "currency": "USD", "url": "https://api.example.net/products/growth-plan"},
    {"id": "p-0004", "name": "Scale Plan", "price": 249, "currency": "USD", "url": "https://api.example.net/products/scale-plan"}
    ]

Notes

  • response.json() reads the response body as JSON directly and keeps the spider code shorter than loading response.text manually.
  • Recent Scrapy examples use start() as the spider entry point, while older project code may still use start_requests().
  • Use JSON Lines or another append-friendly format when repeated runs need to append records instead of replacing the whole file.
  • Keep API tokens and session headers in settings, environment variables, or a secret store rather than committing them into the spider file.