Scraping a JSON API with Scrapy pulls structured records directly from the same data layer many web apps already use internally. That avoids brittle HTML selectors when product listings, search results, or catalog data are already exposed as objects and arrays.
A Scrapy spider requests the API endpoint, reads the response body as JSON, yields item dictionaries, and follows the next page until the API stops returning another URL or cursor. The same crawl can then export its items through Scrapy's feed exporter without adding a separate serialization step.
Authentication headers, cursor rules, and rate limits vary between APIs, so the request headers, pagination fields, and delay settings must match the provider's current behavior. POST-based APIs often expect a JSON request body instead of query parameters, and authenticated endpoints should load tokens from settings or environment variables instead of hardcoding secrets into the spider file.
Related: How to scrape a GraphQL API with Scrapy
Related: How to export Scrapy items to JSON
Steps to scrape a JSON API with Scrapy:
- Create a new Scrapy project for the API crawl.
$ scrapy startproject api_scrape New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in: /srv/api_scrape You can start your first spider with: cd api_scrape scrapy genspider example example.com - Change into the project directory.
$ cd api_scrape
- Generate a spider for the API host.
$ scrapy genspider products api.example.net Created spider 'products' using template 'basic' in module: api_scrape.spiders.products
- Replace the generated spider with JSON parsing and pagination logic.
- api_scrape/spiders/products.py
from urllib.parse import urlencode import scrapy class ProductsSpider(scrapy.Spider): name = "products" allowed_domains = ["api.example.net"] api_endpoint = "https://api.example.net/v1/catalog" per_page = 100 async def start(self): params = {"page": 1, "per_page": self.per_page} yield scrapy.Request( url=f"{self.api_endpoint}?{urlencode(params)}", headers={"Accept": "application/json"}, callback=self.parse, ) def parse(self, response): payload = response.json() for row in payload.get("products", []): yield { "id": row.get("id"), "name": row.get("name"), "price": row.get("price"), "currency": row.get("currency"), "url": row.get("url"), } next_url = payload.get("next") if next_url: yield response.follow( next_url, headers={"Accept": "application/json"}, callback=self.parse, )
The example expects item data under products and the next-page URL under next. Use scrapy.http.JsonRequest instead of Request when the API expects a JSON request body, and add headers such as Authorization when the endpoint is authenticated.
- Add cautious request pacing to the project settings.
- api_scrape/settings.py
DOWNLOAD_DELAY = 1.0 CONCURRENT_REQUESTS_PER_DOMAIN = 2 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1.0 AUTOTHROTTLE_MAX_DELAY = 10.0 FEED_EXPORT_ENCODING = "utf-8"
Aggressive request rates often trigger HTTP 429 responses, temporary bans, or truncated datasets when an API enforces per-client quotas.
- Run the spider and write the collected items to a JSON file.
$ scrapy crawl products -O products.json ##### snipped ##### 2026-04-16 05:51:23 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: products.json 2026-04-16 05:51:23 [scrapy.core.engine] INFO: Spider closed (finished)
-O overwrites an existing file and keeps the result as one valid JSON array when the crawl finishes cleanly.
- Parse the exported file once to confirm the crawl produced valid JSON.
$ python3 -c "import json; print(len(json.load(open('products.json', encoding='utf-8'))))" 4A numeric count confirms the exporter closed the JSON array correctly and that downstream tools can read the file.
- Review the exported records before wiring the crawl into a larger pipeline.
$ cat products.json [ {"id": "p-0001", "name": "Starter Plan", "price": 29, "currency": "USD", "url": "https://api.example.net/products/starter-plan"}, {"id": "p-0002", "name": "Team Plan", "price": 79, "currency": "USD", "url": "https://api.example.net/products/team-plan"}, {"id": "p-0003", "name": "Growth Plan", "price": 129, "currency": "USD", "url": "https://api.example.net/products/growth-plan"}, {"id": "p-0004", "name": "Scale Plan", "price": 249, "currency": "USD", "url": "https://api.example.net/products/scale-plan"} ]
Notes
- response.json() reads the response body as JSON directly and keeps the spider code shorter than loading response.text manually.
- Recent Scrapy examples use start() as the spider entry point, while older project code may still use start_requests().
- Use JSON Lines or another append-friendly format when repeated runs need to append records instead of replacing the whole file.
- Keep API tokens and session headers in settings, environment variables, or a secret store rather than committing them into the spider file.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
