Downloading images during a Scrapy crawl keeps each scraped item tied to the media that the page actually published. That matters for product catalogs, article archives, and training datasets because the crawl captures the image files alongside the extracted item data instead of leaving them on a remote site that may change later.
Scrapy handles this through ImagesPipeline. When a spider yields an item with an image_urls list, the pipeline schedules those image requests through the normal downloader, stores each file under IMAGES_STORE, and writes an images result list back into the item with the original URL, stored path, checksum, and status. Current Scrapy releases still use image_urls and images as the default field names, keep saved images fresh for 90 days through IMAGES_EXPIRES, and store the default image files as URL-hash names under full/.
Current ImagesPipeline still requires Pillow 8.3.2 or later. It also normalizes saved images to JPEG in RGB mode, so a source file such as .png or .webp is usually stored locally as a hashed *.jpg path. Relative src attributes must be converted with response.urljoin(), and redirected media URLs still need MEDIA_ALLOW_REDIRECTS = True before the crawl if the site sends the image through a CDN or signed redirect first.
Related: How to download files with Scrapy
Related: How to enable item pipelines in Scrapy
$ cd /srv/scrapy/images_demo
$ vi images_demo/items.py
import scrapy class GalleryItem(scrapy.Item): title = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()
ImagesPipeline reads URLs from image_urls and writes the per-image results to images unless those field names are customized.
$ vi images_demo/settings.py
ITEM_PIPELINES = { "scrapy.pipelines.images.ImagesPipeline": 1, } IMAGES_STORE = "images-store"
If IMAGES_STORE is missing or points to the wrong location, the crawl can finish without saving the images that the item depends on.
$ vi images_demo/spiders/gallery.py
import scrapy from images_demo.items import GalleryItem class GallerySpider(scrapy.Spider): name = "gallery" start_urls = ["http://media.example.net/gallery.html"] def parse(self, response): for card in response.css("figure"): image_src = card.css("img::attr(src)").get() if image_src: yield GalleryItem( title=card.css("figcaption::text").get(), image_urls=[response.urljoin(image_src)], )
image_urls must stay a list even when each item has only one image, and response.urljoin() converts a relative src value into a full request URL that the pipeline can download.
$ scrapy crawl gallery -O gallery.jl 2026-04-22 05:58:50 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: images_demo) ##### snipped ##### 2026-04-22 05:58:56 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapy.pipelines.images.ImagesPipeline'] 2026-04-22 05:58:57 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://media.example.net/images/gallery-1.png> referred in <None> 2026-04-22 05:58:59 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://media.example.net/images/gallery-2.png> referred in <None> 2026-04-22 05:58:59 [scrapy.extensions.feedexport] INFO: Stored jl feed (2 items) in: gallery.jl
If the image URL first returns 301 or 302, add MEDIA_ALLOW_REDIRECTS = True before this run or the media request is treated as failed.
$ cat gallery.jl
{"title": "Gallery Image 1", "image_urls": ["http://media.example.net/images/gallery-1.png"], "images": [{"url": "http://media.example.net/images/gallery-1.png", "path": "full/190189632ca8f84ddd67245247213422e717d64b.jpg", "checksum": "3c101f2e17b8840bcab68cc377586430", "status": "downloaded"}]}
{"title": "Gallery Image 2", "image_urls": ["http://media.example.net/images/gallery-2.png"], "images": [{"url": "http://media.example.net/images/gallery-2.png", "path": "full/bc4de106168178ec5616391b97c904c442dd2796.jpg", "checksum": "fc1f6b7f319e089d0f1f04b239bf0396", "status": "downloaded"}]}
The path value is relative to IMAGES_STORE, not an absolute filesystem path.
$ ls images-store/full 190189632ca8f84ddd67245247213422e717d64b.jpg bc4de106168178ec5616391b97c904c442dd2796.jpg
ImagesPipeline hashes the source URL into the stored filename by default. Override file_path() in a custom pipeline when original filenames or per-item folders matter more than the default hash-based layout.