How to scrape a JSON API with Scrapy

Scraping data from a JSON API avoids brittle HTML parsing and delivers structured records that flow cleanly into analytics, monitoring, or content pipelines.

A Scrapy spider sends HTTP requests, receives responses, and turns each JSON object into an item by loading the response body and yielding dictionaries. Pagination is handled by scheduling follow-up requests until the API indicates completion.

API endpoints frequently enforce authentication, strict rate limits, and pagination contracts that vary between providers, so spider logic and throttling settings must match the documented behavior to avoid HTTP 401/403 failures, HTTP 429 throttling, or temporary blocks.

Steps to scrape a JSON API with Scrapy:

Fetch a sample response from the JSON API endpoint.

$ curl -s 'http://api.example.net:8000/api/catalog?page=1' | head -n 20
{
  "products": [
    {
      "id": "p-0001",
      "name": "Starter Plan",
      "price": 29,
      "url": "http://app.internal.example:8000/products/starter-plan.html"
    }
##### snipped #####

Identify the JSON keys used for item data and pagination.

Common patterns include products or items for arrays and next / links.next for the next-page URL.

Create a new Scrapy project.

$ scrapy startproject api_scrape
New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in:
    /root/sg-work/api_scrape

You can start your first spider with:
    cd api_scrape
    scrapy genspider example example.com

Change into the project directory.
```
$ cd api_scrape
```

Generate a new spider for the API host.

$ scrapy genspider products api.example.net
Created spider 'products' using template 'basic' in module:
  api_scrape.spiders.products

Edit the spider to scrape items from the API JSON response.

api_scrape/spiders/products.py

import json
from urllib.parse import urlencode
 
import scrapy
 
 
class ProductsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["api.example.net"]
    api_endpoint = "http://api.example.net:8000/api/catalog"
    per_page = 3
 
    def start_requests(self):
        page = 1
        params = {"page": page, "per_page": self.per_page}
        url = f"{self.api_endpoint}?{urlencode(params)}"
        headers = {"Accept": "application/json"}
        yield scrapy.Request(
            url=url,
            headers=headers,
            callback=self.parse,
            meta={"page": page},
        )
 
    def parse(self, response):
        page = int(response.meta.get("page", 1))
        payload = json.loads(response.text)
 
        rows = payload.get("products", [])
        for row in rows:
            yield {
                "id": row.get("id"),
                "name": row.get("name"),
                "price": row.get("price"),
                "url": row.get("url"),
            }
 
        if not rows:
            return
 
        next_page = page + 1
        params = {"page": next_page, "per_page": self.per_page}
        next_url = f"{self.api_endpoint}?{urlencode(params)}"
        headers = {"Accept": "application/json"}
        yield scrapy.Request(
            url=next_url,
            headers=headers,
            callback=self.parse,
            meta={"page": next_page},
        )

Adjust api_endpoint, items, and the yielded fields to match the API response, then add required headers such as Authorization when the endpoint is authenticated.

Set a conservative download delay in settings.py.
api_scrape/settings.py
```
DOWNLOAD_DELAY = 1.0
CONCURRENT_REQUESTS_PER_DOMAIN = 2
 
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 10.0
```
Aggressive request rates commonly trigger HTTP 429 throttling or temporary IP blocks on public APIs.

Run the spider to export scraped items to a JSON file.

$ scrapy crawl products -O products.json
##### snipped #####
[scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: products.json
[scrapy.core.engine] INFO: Closing spider (finished)
{'downloader/request_count': 4, 'item_scraped_count': 5}

Use -O to overwrite an existing output file, or -o to append.

Verify the output file contains scraped items.

$ python3 -c 'import json; print(len(json.load(open("products.json", encoding="utf-8"))))'
5

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.