How to scrape a JSON API with Scrapy

Scraping a JSON API with Scrapy pulls structured records directly from the application data layer instead of rebuilding the same objects from HTML. That keeps the crawl resilient when the useful data already arrives as arrays, objects, and pagination fields.

Current Scrapy releases can start API requests from async def start(), while older project code may still use start_requests(), and parse the response body with response.json() in the callback. A spider can yield item dictionaries from the returned payload and follow the next-page URL or cursor until the endpoint stops advertising more results.

Authenticated APIs often require bearer tokens, CSRF headers, or session cookies in addition to URL parameters. When later requests depend on Scrapy's cookie middleware, current Scrapy docs say to pass cookies through the cookies= argument instead of only sending a raw Cookie header. POST-based APIs should switch to scrapy.http.JsonRequest, and conservative delays help avoid HTTP 429 responses or short-lived bans.

Steps to scrape a JSON API with Scrapy:

  1. Create a new Scrapy project for the API crawl.
    $ scrapy startproject api_scrape
    New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in:
        /srv/api_scrape
    
    You can start your first spider with:
        cd api_scrape
        scrapy genspider example example.com
  2. Change into the project directory.
    $ cd api_scrape
  3. Generate a spider file for the API host.
    $ scrapy genspider products api.example.net
    Created spider 'products' using template 'basic' in module:
      api_scrape.spiders.products
  4. Replace the generated spider with JSON parsing, auth-header reuse, and next-page handling.
    api_scrape/spiders/products.py
    from os import environ
    from urllib.parse import urlencode
     
    import scrapy
     
     
    class ProductsSpider(scrapy.Spider):
        name = "products"
        allowed_domains = ["api.example.net"]
        api_endpoint = "https://api.example.net/v1/products"
        per_page = 100
        custom_settings = {
            "DOWNLOAD_DELAY": 1.0,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 2,
            "AUTOTHROTTLE_ENABLED": True,
            "AUTOTHROTTLE_START_DELAY": 1.0,
            "AUTOTHROTTLE_MAX_DELAY": 10.0,
            "FEED_EXPORT_ENCODING": "utf-8",
        }
     
        def api_headers(self):
            headers = {"Accept": "application/json"}
            token = environ.get("EXAMPLE_API_TOKEN")
            if token:
                headers["Authorization"] = f"Bearer {token}"
            return headers
     
        async def start(self):
            params = {"page": 1, "per_page": self.per_page}
            yield scrapy.Request(
                url=f"{self.api_endpoint}?{urlencode(params)}",
                headers=self.api_headers(),
                callback=self.parse,
            )
     
        def parse(self, response):
            payload = response.json()
     
            for row in payload.get("products", []):
                yield {
                    "id": row.get("id"),
                    "name": row.get("name"),
                    "price": row.get("price"),
                    "currency": row.get("currency"),
                    "url": row.get("url"),
                }
     
            next_url = payload.get("next")
            if next_url:
                yield response.follow(
                    next_url,
                    headers=self.api_headers(),
                    callback=self.parse,
                )

    The example expects product records under products and the next-page URL under next. Some APIs return a cursor token instead of a full URL, in which case the next request should rebuild the URL or JSON body from that cursor.

  5. Run the spider and overwrite the JSON export file for the current crawl.
    $ scrapy crawl products -O products.json
    2026-04-22 07:18:11 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: api_scrape)
    ##### snipped #####
    2026-04-22 07:18:26 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: products.json
    2026-04-22 07:18:26 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites any existing products.json and writes one complete JSON array when the crawl finishes cleanly. Use JSON Lines when repeated runs should append or stream items instead of replacing the whole file.

  6. Parse the saved file once before handing it to another tool.
    $ python3 -c "import json; print(len(json.load(open('products.json', encoding='utf-8'))))"
    4

    A numeric count confirms that the exported file closed as valid JSON instead of a truncated partial array.

  7. Review the exported records before wiring the crawl into a larger pipeline.
    $ cat products.json
    [
    {"id": "p-0001", "name": "Starter Plan", "price": 29, "currency": "USD", "url": "https://api.example.net/products/starter-plan"},
    {"id": "p-0002", "name": "Team Plan", "price": 79, "currency": "USD", "url": "https://api.example.net/products/team-plan"},
    {"id": "p-0003", "name": "Growth Plan", "price": 129, "currency": "USD", "url": "https://api.example.net/products/growth-plan"},
    {"id": "p-0004", "name": "Scale Plan", "price": 249, "currency": "USD", "url": "https://api.example.net/products/scale-plan"}
    ]