Scraping data from a JSON API avoids brittle HTML parsing and delivers structured records that flow cleanly into analytics, monitoring, or content pipelines.
A Scrapy spider sends HTTP requests, receives responses, and turns each JSON object into an item by loading the response body and yielding dictionaries. Pagination is handled by scheduling follow-up requests until the API indicates completion.
API endpoints frequently enforce authentication, strict rate limits, and pagination contracts that vary between providers, so spider logic and throttling settings must match the documented behavior to avoid HTTP 401/403 failures, HTTP 429 throttling, or temporary blocks.
Related: How to scrape a GraphQL API with Scrapy
Related: How to export Scrapy items to JSON
Steps to scrape a JSON API with Scrapy:
- Fetch a sample response from the JSON API endpoint.
$ curl -s 'http://api.example.net:8000/api/catalog?page=1' | head -n 20 { "products": [ { "id": "p-0001", "name": "Starter Plan", "price": 29, "url": "http://app.internal.example:8000/products/starter-plan.html" } ##### snipped ##### - Identify the JSON keys used for item data and pagination.
Common patterns include products or items for arrays and next / links.next for the next-page URL.
- Create a new Scrapy project.
$ scrapy startproject api_scrape New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in: /root/sg-work/api_scrape You can start your first spider with: cd api_scrape scrapy genspider example example.com - Change into the project directory.
$ cd api_scrape
- Generate a new spider for the API host.
$ scrapy genspider products api.example.net Created spider 'products' using template 'basic' in module: api_scrape.spiders.products
- Edit the spider to scrape items from the API JSON response.
- api_scrape/spiders/products.py
import json from urllib.parse import urlencode import scrapy class ProductsSpider(scrapy.Spider): name = "products" allowed_domains = ["api.example.net"] api_endpoint = "http://api.example.net:8000/api/catalog" per_page = 3 def start_requests(self): page = 1 params = {"page": page, "per_page": self.per_page} url = f"{self.api_endpoint}?{urlencode(params)}" headers = {"Accept": "application/json"} yield scrapy.Request( url=url, headers=headers, callback=self.parse, meta={"page": page}, ) def parse(self, response): page = int(response.meta.get("page", 1)) payload = json.loads(response.text) rows = payload.get("products", []) for row in rows: yield { "id": row.get("id"), "name": row.get("name"), "price": row.get("price"), "url": row.get("url"), } if not rows: return next_page = page + 1 params = {"page": next_page, "per_page": self.per_page} next_url = f"{self.api_endpoint}?{urlencode(params)}" headers = {"Accept": "application/json"} yield scrapy.Request( url=next_url, headers=headers, callback=self.parse, meta={"page": next_page}, )
Adjust api_endpoint, items, and the yielded fields to match the API response, then add required headers such as Authorization when the endpoint is authenticated.
- Set a conservative download delay in settings.py.
- api_scrape/settings.py
DOWNLOAD_DELAY = 1.0 CONCURRENT_REQUESTS_PER_DOMAIN = 2 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1.0 AUTOTHROTTLE_MAX_DELAY = 10.0
Aggressive request rates commonly trigger HTTP 429 throttling or temporary IP blocks on public APIs.
- Run the spider to export scraped items to a JSON file.
$ scrapy crawl products -O products.json ##### snipped ##### [scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: products.json [scrapy.core.engine] INFO: Closing spider (finished) {'downloader/request_count': 4, 'item_scraped_count': 5}Use -O to overwrite an existing output file, or -o to append.
- Verify the output file contains scraped items.
$ python3 -c 'import json; print(len(json.load(open("products.json", encoding="utf-8"))))' 5
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
