Scraping a JSON API with Scrapy pulls structured records directly from the application data layer instead of rebuilding the same objects from HTML. That keeps the crawl resilient when the useful data already arrives as arrays, objects, and pagination fields.
Current Scrapy releases can start API requests from async def start(), while older project code may still use start_requests(), and parse the response body with response.json() in the callback. A spider can yield item dictionaries from the returned payload and follow the next-page URL or cursor until the endpoint stops advertising more results.
Authenticated APIs often require bearer tokens, CSRF headers, or session cookies in addition to URL parameters. When later requests depend on Scrapy's cookie middleware, current Scrapy docs say to pass cookies through the cookies= argument instead of only sending a raw Cookie header. POST-based APIs should switch to scrapy.http.JsonRequest, and conservative delays help avoid HTTP 429 responses or short-lived bans.
Related: How to scrape a GraphQL API with Scrapy
Related: How to export Scrapy items to JSON
Steps to scrape a JSON API with Scrapy:
- Create a new Scrapy project for the API crawl.
$ scrapy startproject api_scrape New Scrapy project 'api_scrape', using template directory '##### snipped #####', created in: /srv/api_scrape You can start your first spider with: cd api_scrape scrapy genspider example example.com - Change into the project directory.
$ cd api_scrape
- Generate a spider file for the API host.
$ scrapy genspider products api.example.net Created spider 'products' using template 'basic' in module: api_scrape.spiders.products
- Replace the generated spider with JSON parsing, auth-header reuse, and next-page handling.
- api_scrape/spiders/products.py
from os import environ from urllib.parse import urlencode import scrapy class ProductsSpider(scrapy.Spider): name = "products" allowed_domains = ["api.example.net"] api_endpoint = "https://api.example.net/v1/products" per_page = 100 custom_settings = { "DOWNLOAD_DELAY": 1.0, "CONCURRENT_REQUESTS_PER_DOMAIN": 2, "AUTOTHROTTLE_ENABLED": True, "AUTOTHROTTLE_START_DELAY": 1.0, "AUTOTHROTTLE_MAX_DELAY": 10.0, "FEED_EXPORT_ENCODING": "utf-8", } def api_headers(self): headers = {"Accept": "application/json"} token = environ.get("EXAMPLE_API_TOKEN") if token: headers["Authorization"] = f"Bearer {token}" return headers async def start(self): params = {"page": 1, "per_page": self.per_page} yield scrapy.Request( url=f"{self.api_endpoint}?{urlencode(params)}", headers=self.api_headers(), callback=self.parse, ) def parse(self, response): payload = response.json() for row in payload.get("products", []): yield { "id": row.get("id"), "name": row.get("name"), "price": row.get("price"), "currency": row.get("currency"), "url": row.get("url"), } next_url = payload.get("next") if next_url: yield response.follow( next_url, headers=self.api_headers(), callback=self.parse, )
The example expects product records under products and the next-page URL under next. Some APIs return a cursor token instead of a full URL, in which case the next request should rebuild the URL or JSON body from that cursor.
- Run the spider and overwrite the JSON export file for the current crawl.
$ scrapy crawl products -O products.json 2026-04-22 07:18:11 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: api_scrape) ##### snipped ##### 2026-04-22 07:18:26 [scrapy.extensions.feedexport] INFO: Stored json feed (4 items) in: products.json 2026-04-22 07:18:26 [scrapy.core.engine] INFO: Spider closed (finished)
-O overwrites any existing products.json and writes one complete JSON array when the crawl finishes cleanly. Use JSON Lines when repeated runs should append or stream items instead of replacing the whole file.
- Parse the saved file once before handing it to another tool.
$ python3 -c "import json; print(len(json.load(open('products.json', encoding='utf-8'))))" 4A numeric count confirms that the exported file closed as valid JSON instead of a truncated partial array.
- Review the exported records before wiring the crawl into a larger pipeline.
$ cat products.json [ {"id": "p-0001", "name": "Starter Plan", "price": 29, "currency": "USD", "url": "https://api.example.net/products/starter-plan"}, {"id": "p-0002", "name": "Team Plan", "price": 79, "currency": "USD", "url": "https://api.example.net/products/team-plan"}, {"id": "p-0003", "name": "Growth Plan", "price": 129, "currency": "USD", "url": "https://api.example.net/products/growth-plan"}, {"id": "p-0004", "name": "Scale Plan", "price": 249, "currency": "USD", "url": "https://api.example.net/products/scale-plan"} ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
