Scraping data from a JSON API avoids brittle HTML parsing and delivers structured records that flow cleanly into analytics, monitoring, or content pipelines.
A Scrapy spider sends HTTP requests, receives responses, and turns each JSON object into an item by loading the response body and yielding dictionaries. Pagination is handled by scheduling follow-up requests until the API indicates completion.
API endpoints frequently enforce authentication, strict rate limits, and pagination contracts that vary between providers, so spider logic and throttling settings must match the documented behavior to avoid HTTP 401/403 failures, HTTP 429 throttling, or temporary blocks.
Related: How to scrape a GraphQL API with Scrapy
Related: How to export Scrapy items to JSON
$ curl -s 'http://api.example.net:8000/api/catalog?page=1' | head -n 20
{
"products": [
{
"id": "p-0001",
"name": "Starter Plan",
"price": 29,
"url": "http://app.internal.example:8000/products/starter-plan.html"
}
##### snipped #####
Common patterns include products or items for arrays and next / links.next for the next-page URL.
$ scrapy startproject api_scrape
New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in:
/root/sg-work/api_scrape
You can start your first spider with:
cd api_scrape
scrapy genspider example example.com
$ cd api_scrape
$ scrapy genspider products api.example.net Created spider 'products' using template 'basic' in module: api_scrape.spiders.products
import json from urllib.parse import urlencode import scrapy class ProductsSpider(scrapy.Spider): name = "products" allowed_domains = ["api.example.net"] api_endpoint = "http://api.example.net:8000/api/catalog" per_page = 3 def start_requests(self): page = 1 params = {"page": page, "per_page": self.per_page} url = f"{self.api_endpoint}?{urlencode(params)}" headers = {"Accept": "application/json"} yield scrapy.Request( url=url, headers=headers, callback=self.parse, meta={"page": page}, ) def parse(self, response): page = int(response.meta.get("page", 1)) payload = json.loads(response.text) rows = payload.get("products", []) for row in rows: yield { "id": row.get("id"), "name": row.get("name"), "price": row.get("price"), "url": row.get("url"), } if not rows: return next_page = page + 1 params = {"page": next_page, "per_page": self.per_page} next_url = f"{self.api_endpoint}?{urlencode(params)}" headers = {"Accept": "application/json"} yield scrapy.Request( url=next_url, headers=headers, callback=self.parse, meta={"page": next_page}, )
Adjust api_endpoint, items, and the yielded fields to match the API response, then add required headers such as Authorization when the endpoint is authenticated.
DOWNLOAD_DELAY = 1.0 CONCURRENT_REQUESTS_PER_DOMAIN = 2 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1.0 AUTOTHROTTLE_MAX_DELAY = 10.0
Aggressive request rates commonly trigger HTTP 429 throttling or temporary IP blocks on public APIs.
$ scrapy crawl products -O products.json
##### snipped #####
[scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: products.json
[scrapy.core.engine] INFO: Closing spider (finished)
{'downloader/request_count': 4, 'item_scraped_count': 5}
Use -O to overwrite an existing output file, or -o to append.
$ python3 -c 'import json; print(len(json.load(open("products.json", encoding="utf-8"))))'
5