Keeping images alongside scraped item data makes product catalogs, archives, and machine-learning datasets reproducible without relying on the source site staying online. Scrapy's ImagesPipeline automates image fetching and storage so spiders only need to extract image URLs.
ImagesPipeline reads image URLs from the image_urls field, schedules media requests through Scrapy’s downloader, and saves image files under the directory configured by IMAGES_STORE. After downloads complete, the pipeline populates the images field with metadata such as the original URL, the relative storage path (for example full/<sha1>.jpg), and a checksum for integrity tracking.
The pipeline processes images through Pillow and converts them to JPEG in RGB mode by default, so Pillow must be available in the runtime environment. Large image sets can consume disk and bandwidth quickly, and aggressive concurrency can trigger hotlink protection or rate limits, so keep crawl settings reasonable and store images on a volume with enough space. Image URLs should be absolute (or converted using response.urljoin()) so downloads do not fail on relative paths.
Related: How to download files with Scrapy \ Related: How to enable item pipelines in Scrapy
$ vi images_demo/items.py
import scrapy class GalleryItem(scrapy.Item): title = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()
$ vi images_demo/settings.py
IMAGES_STORE = "/root/sg-work/images-store"
Files are stored under full/ inside IMAGES_STORE.
ITEM_PIPELINES = { "scrapy.pipelines.images.ImagesPipeline": 1, }
ImagesPipeline requires Pillow; missing dependency prevents image processing and storage.
$ mkdir -p /root/sg-work/images-store
$ vi images_demo/spiders/gallery.py
import scrapy from images_demo.items import GalleryItem class GallerySpider(scrapy.Spider): name = "gallery" start_urls = ["http://app.internal.example:8000/gallery/"] def parse(self, response): for card in response.css("figure"): item = GalleryItem() item["title"] = card.css("figcaption::text").get() item["image_urls"] = [ response.urljoin(url) for url in card.css("img::attr(src)").getall() ] yield item
image_urls must be a list, including single-image items.
$ scrapy crawl gallery -O gallery.json 2026-01-01 09:49:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://files.example.net:8000/images/gallery-2.png> referred in <None> 2026-01-01 09:49:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://files.example.net:8000/images/gallery-1.png> referred in <None> ##### snipped #####
Stored filenames default to the SHA1 hash of the image URL as full/<sha1>.jpg.
$ head -n 18 gallery.json
[
{"title": "Gallery Image 2", "image_urls": ["http://files.example.net:8000/images/gallery-2.png"], "images": [{"url": "http://files.example.net:8000/images/gallery-2.png", "path": "full/ff26f717928f6093552dcc2faa024598524121cd.jpg", "checksum": "2775f338c469b19c338c4e0ea410271c", "status": "downloaded"}]},
{"title": "Gallery Image 1", "image_urls": ["http://files.example.net:8000/images/gallery-1.png"], "images": [{"url": "http://files.example.net:8000/images/gallery-1.png", "path": "full/24b14b0355f3bc22f1211a1baefd42641792246f.jpg", "checksum": "2775f338c469b19c338c4e0ea410271c", "status": "downloaded"}]}
]
$ ls /root/sg-work/images-store/full 24b14b0355f3bc22f1211a1baefd42641792246f.jpg ff26f717928f6093552dcc2faa024598524121cd.jpg