Scraping data from a JSON API avoids brittle HTML parsing and delivers structured records that flow cleanly into analytics, monitoring, or content pipelines.

A Scrapy spider sends HTTP requests, receives responses, and turns each JSON object into an item by loading the response body and yielding dictionaries. Pagination is handled by scheduling follow-up requests until the API indicates completion.

API endpoints frequently enforce authentication, strict rate limits, and pagination contracts that vary between providers, so spider logic and throttling settings must match the documented behavior to avoid HTTP 401/403 failures, HTTP 429 throttling, or temporary blocks.

Steps to scrape a JSON API with Scrapy:

  1. Fetch a sample response from the JSON API endpoint.
    $ curl -s 'http://api.example.net:8000/api/catalog?page=1' | head -n 20
    {
      "products": [
        {
          "id": "p-0001",
          "name": "Starter Plan",
          "price": 29,
          "url": "http://app.internal.example:8000/products/starter-plan.html"
        }
    ##### snipped #####
  2. Identify the JSON keys used for item data and pagination.

    Common patterns include products or items for arrays and next / links.next for the next-page URL.

  3. Create a new Scrapy project.
    $ scrapy startproject api_scrape
    New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in:
        /root/sg-work/api_scrape
    
    You can start your first spider with:
        cd api_scrape
        scrapy genspider example example.com
  4. Change into the project directory.
    $ cd api_scrape
  5. Generate a new spider for the API host.
    $ scrapy genspider products api.example.net
    Created spider 'products' using template 'basic' in module:
      api_scrape.spiders.products
  6. Edit the spider to scrape items from the API JSON response.
    api_scrape/spiders/products.py
    import json
    from urllib.parse import urlencode
     
    import scrapy
     
     
    class ProductsSpider(scrapy.Spider):
        name = "products"
        allowed_domains = ["api.example.net"]
        api_endpoint = "http://api.example.net:8000/api/catalog"
        per_page = 3
     
        def start_requests(self):
            page = 1
            params = {"page": page, "per_page": self.per_page}
            url = f"{self.api_endpoint}?{urlencode(params)}"
            headers = {"Accept": "application/json"}
            yield scrapy.Request(
                url=url,
                headers=headers,
                callback=self.parse,
                meta={"page": page},
            )
     
        def parse(self, response):
            page = int(response.meta.get("page", 1))
            payload = json.loads(response.text)
     
            rows = payload.get("products", [])
            for row in rows:
                yield {
                    "id": row.get("id"),
                    "name": row.get("name"),
                    "price": row.get("price"),
                    "url": row.get("url"),
                }
     
            if not rows:
                return
     
            next_page = page + 1
            params = {"page": next_page, "per_page": self.per_page}
            next_url = f"{self.api_endpoint}?{urlencode(params)}"
            headers = {"Accept": "application/json"}
            yield scrapy.Request(
                url=next_url,
                headers=headers,
                callback=self.parse,
                meta={"page": next_page},
            )

    Adjust api_endpoint, items, and the yielded fields to match the API response, then add required headers such as Authorization when the endpoint is authenticated.

  7. Set a conservative download delay in settings.py.
    api_scrape/settings.py
    DOWNLOAD_DELAY = 1.0
    CONCURRENT_REQUESTS_PER_DOMAIN = 2
     
    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 1.0
    AUTOTHROTTLE_MAX_DELAY = 10.0

    Aggressive request rates commonly trigger HTTP 429 throttling or temporary IP blocks on public APIs.

  8. Run the spider to export scraped items to a JSON file.
    $ scrapy crawl products -O products.json
    ##### snipped #####
    [scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: products.json
    [scrapy.core.engine] INFO: Closing spider (finished)
    {'downloader/request_count': 4, 'item_scraped_count': 5}

    Use -O to overwrite an existing output file, or -o to append.

  9. Verify the output file contains scraped items.
    $ python3 -c 'import json; print(len(json.load(open("products.json", encoding="utf-8"))))'
    5